DOCUMENT RESUME 



ED 041 856 



24 



SP 004 122 



AUTHOR 

TITLE 

INSTITUTION 
SPONS AGENCY 

BUREAU NO 
PUB DATE 
GRANT 
NOTE 



Abramson, Theodore 

Development of Improved Techniques for Establishing 
the Reliability of Observation Ratings. Einal Report. 
City Univ. of New York, N.Y. 

Office of Education (DHEW) , Washington, D.C. Bureau 
of Research. 

BR-9-B-070 
Jan 70 

OEG- 2- 9- 400070-1 039 (010) 

8 3 p . 



EDRS PRICE EDRS Price ME-$0.50 HC-$4.25 

DESCRIPTORS Analysis of Variance, ^Classroom Observation 

Techniques, Measurement Techniques, ^Observation , 
Reliability, Statistical Analysis, ^Student 
Behavior, ^Teacher Behavior 



AESTRACT 

This investigation sought to develop and apply 
analysis of variance techniques to the estimation of the reliability 
of observation schedules. It placed special emphasis on the different 
possible designs and the various administrative situations in which 
they might be applied. The application of the general model to a 
specific instance was then carried out. The study was conducted with 
ten recorders who observed five teachers through a one-way mirror and 
rated them on an observational schedule. The SUTEC (School University 
Teacher Education Center) Observational Schedule was used. (A copy is 
attached to the document.) Items observed included teacher mobility, 
the involvement of the children, materials present, materials in use, 
irrelevant behavior by the children, directed behavior, and 
spontaneous behavior. Analysis of variance was applied to the data 
obtained. The major conclusions were that, different variance 
component models could be applied in different situations to estimate 
the reliability of either the entire observation schedule or parts of 
it, and that the items comprising the SUTEC schedule did 
differentiate fairly well between teachers. (Author/MBM) 



Sf004l&3< 




FINAL REPORT 
Project No. 9-b-070 
Grant No. 0EG-2-9*400070-1039(010) 

U,S, DEPARTMENT OF HEALTH, EDUCATION 
& WELFARE 
OFFICE OF EDUCATION 
THIS DOCUMENT HAS BEEN REPRODUCED 
EXACTLY AS RECEIVED FROM THE PERSON OR 
ORGANIZATION ORIGINATING IT. POINTS OF 
VIEW OR OPINIONS STATED DO NOT NECES- 
SARILY REPRESENT OFFICIAL OFFICE OF EDU- 
CATION POSITION OR POLICY 

DEVELOPMENT OP IMPROVED TECHNIQUES FOR ESTABLISHING 
THE RELIABILITY OP OBSERVATION RATINGS 

Theodore Abramson 
School of Education 
Fordham University 
Lincoln Center Campus 
New York, N. Y. 10023 



January 1970 



U.S. DEPARTMENT OP' 

HEALTH, EDUCATION, AND WELFARE 

Office of Education 
Bureau of Research 

i 



TABLE OF CONTENTS 



PAGE 

SUMMARY vii 

CHAPTER 

I. INTRODUCTION 1 

Statement of the Problem 2 

Definition of Terms 3 

Significance of the Problem 4 

Limitations of the Study 5 

Review of Related Literature 5 

Traditional Methods of Calculating the 

Reliability of Observational Data ... 6 

Summary of Literature on Traditional 
Methods of Calculating Reliability of 

Observational Data 11 

Recent Methods of Calculating the 

Reliability of Observational Data ... 11 

Summary of Literature on Recent Methods 
of Calculating Reliability of 

Observational Data 23 

Summary of Related Literature 24 

II. THE SUBJECTS, MATERIALS, AND PROCEDURES . . 25 

The Subjects 25 

The Materials 26 

The Procedures 27 

The Statistical Procedures 27 







CHAPTER PAGE 

The Variables and Designs . 31 

III. ANALYSIS OF THE RESULTS OF THE INVESTIGATION 55 

Reliability of Individual Items 55 

Reliability of the Entire Schedule. ... 59 

IV. CONCLUSIONS AND RECOMMENDATIONS 65 

Conclusions 65 

Recommendations 66 

REFERENCES. . . 68 

APPENDIX 7 n 



LIST OF TABLES 

TABLE p AGE 

I I I • Medley and Mitzel Reliability ANOVA 15 

2. Medley and Mitzel Expanded Reliability ANOVA 17 

3. Medley and Mitzel Estimation of Variance 

Components 19 

4. Schematic Representation of a Three Factor 

Partially Nested Design 33 

5. Schematic Representation of Two Factor 

Repeated Measures Design 34 

6. Two way ANOVA for Repeated Measures 37 

7. Estimation of Variance Components for a Two 

Factor Repeated Measures Design • 38 

8. Representation of Three Factor Repeated 
\ Measures Design with n Subscores 39 



• • • 
in 




■M 



\ 

i 






mem 




TABLE 

9. Three Factor ANOVA with Repeated Measures 

on One Factor 

10. Estimation of Variance Components for a 

Three Factor Design with Repeated 
Measures on One Factor 

11. Schematic Representation of Three Factor 

Design with Two Repeated Measures and 
n Subscores 

12. Three Factor ANOVA with Repeated Measures 

on Two Factors 

13. Expected Mean Squares of Three Factor Design 

with Repeated Measures on Two Factors. . . 

14. Estimation of Variance Components for a 

Three Factor Design with Repeated Measures 
on Two Factors 

15. Four Factor Design with Two Repeated Measures 

16. Four Random Factors Design with Two Repeated 

Measures 

17. Degrees of Freedom for a Four Factor Design 

with Two Repeated Measures 

18. Variance Components for a Four Factor Design 

with Two Repeated Measures 

19. Analysis of Variance of the SUTEC 

Observation Team Data • <. . 



PAGE 



iv 



41 



42 



43 



44 



45 



46 

48 



50 



52 



53 



56 



7 



H 






TABLE PAGE 

20. Estimation of Variance Components of the 

SUTEC Observation Team Data 56 

21. ANOVA for the Four Observers Present During 

Both Observations 57 

22. Estimation of Variance Components for the 

Four Observers Present During Both Visits. 57 



23. Variances and Correlations for the Entire 

Observation Team and the Four Observers 

Present During Both Observations 58 

24. Sources of Variation, Degrees of Freedom, 

and Expected Mean Squares for an ANOVA 
• Design with Factors B Nested under 



Factor A ........ 60 

25. ANOVA Design with Factor B Nested under 

Factor A, All Factors Random, and n = 1. . 61 

26. Analysis of Variance of an Observation 

Schedule Containing Seven Items and Using 



Three Observer Teams and Three Teachers. . . 62 

27. Estimation of Variance Components for an 
Observation Schedule Containing Seven 
Items and Using Three Observer Teams and 
Three Teachers 63 








Acknowledgments 



Most of this report resulted from the author’s dissertation in 
the School of Education at Fordham University. The work was 
conducted under the mentorship of Dr. Frances J, Crowley whose 
help and guidance are gratefully acknowledged. 



vi 



o 



Sumary 



Innovations in teacher training are often dependent on 
observational data. The problem of measuring reliability 
of observations collected by a team is due to (l) the 
difficulties of maintaining an observer team intact over 
an extended period of time and (2) observing each teacher 
a number of times , or more than once. These two 
conditions are normally required if one is to ’apply the 
Analysis of Variance (aHOVa) model proposed by Medley and 
Mitzel in 1963. 

This study presented a number of different ANOVA models and 
the administrative conditions under which they were to be 
applied such that the partitioning of the sources of 
variation and the calculation of reliability coefficients 
could be carried out. Specif ically, a model was designed 
for the observer team situation in which the team 
visited a number of different teachers only once and where 
the team did not necessarily contain the same members for 
all visits. The paradigm was developed for situations in 
which there were n. observations per item per observer and 
also for the situations when there was only one observation 
per item per observer. 

The model was applied to data collected by teams of 
observers from the use of an observation schedule of 
teacher and pupil classroom situations and behaviors. The 
schedule Items were teacher mobility, involvement of 
children, materials present, materials in use, directed 
behavior, spontaneous behavior, and irrelevant acts. 

The reliabilities of the nobility, involvement of children, 
and Irrelevant acts were ,72, .67, and .69, respectively. 
The overall reliability coefficient of .37 and the 
variance components of .38, .18, and .07 for the items, 
interaction, and error terms respectively Indicated that 
the teacher and item factors accounted for 75 % of the 
total variance. 

Future research which would field test and compare 
different administrative situations and their respective 
reliability coefficients calculated from the appropriate 
designs, was recommended. 



CHAPTER I 



Introduction 

Teaching is often considered an applied science or art. 
This concept would lead one to expect that a good deal 
of research on teaching has been done by observing the 
classroom where the underlying educational principles 
are actually applied. Medley and Mitzel, in referring 
to the paucity of such research, stated: 

Certainly there is no nore obvious approach 
to research on teaching than direct 
observation of the behavior of teachers 
while they teach and pupils while they 
learn. Yet it is a rare study indeed that 
includes any formal observation at all 
> (Medley & Mitzel, 1963, p. 247). 

In recent years a number of different classroom 
observation schedules which permit the classification 
of teacher and pupil behaviors into a variety of category 
schemes have been formulated. Although not explicitly 
stated in most instances by their originators, the 
underlying rationale for many observation schedules 
was that given by Soar who stated n . . . it is possible 
to identify and measure a common core of teacher- 
pupil classroom behaviors which are basic to most, if 
not all (important) aspects of pupil intellectual, 
personal, and social groitfth"(s6ar, 1966, p. 2). 

Medley (1967) indicated that the theoretical formulation 
behind the construction of the Observation Schedule 
and Record (OScAR) consisted of the relationship between 
three levels of teacher behavior and effectiveness. 

The levels consisted of the variables related to 
classroom climate, the conducting of learning 
experiences, and the maintaining of pupil involvement. 

The advent of these schedules has actually made possible 
the increased application of research technology to a 
wide variety of ' educational problems such as school 
program evaluation and teacher preparation program 
evaluation. The schedules have also facilitated the 
development of theories of teaching. Certainly, data 
drawn directly from actual classroom behavior provide 
a more adequate sample of the teaching-learning situation 
from which inferences can be drawn on the worth of a 
program, than do such ad hoc factors as pupil 
achievement or attitude toward the "new” program* 



Observational data, unfortunately, besides being expensive 
to obtain, is no more precise an index of actual behavior 
than the team’s ability to observe and classify accurately 
that which transpires in the classroom* Therefore, what 
is required is not merely a well trained tern, but a team 
whose members see and report the same things with accuracy 
and consistency so that in effect the data reported by 
different members of the team are comparable. To insure 
maximum comparability and minimum variation of data collected 
by the members of the observation team a schedule is usually 
devised. The schedule, by listing the cues to be responded, 
to, helps to minimize the observer error. Thus the 
usefulness of observational data is to a great extent 
dependent on what has been called inter-observer agreement 
or reliability. 

In the past the reliability of observation schedules has 
usually been examined through correlation analyses which 
yielded a measure of inter-observer agreement between only 
two observers. During the five years from 1958 to 1963 an 
analysis of variance (ANOVA) technique (Medley S Mitzel, 
1958a, 1963) using a factorial design was developed which 
permitted the variance to be partitioned into its component 
parts and the calculation of an overall reliability 
coefficient. The application of this model required that the 
same observers visit the same teachers a number of times. 

The logistical problems involved in applying this ANOVA model 
have made it administratively unfeasible and to date very 
little use has been made of this method of calculating 
reliabilities. What was required in order to make the ANOVA 
technique more applicable? What assumptions and restrictions 
would have to be imposed if the technique were to be 
statistically valid? These were some of the basic questions 
which this study addressed itself to and sought to answer. 

Statement of the Problem 

The purpose of the study was to investigate the conditions 
under w 7 hich an ANOVA model or models could be applied to the 
calculation of an overall reliability coefficient and the 
partitioning of the sources of variation into its component 
parts of an observational schedule without requiring the same 
administratively unfeasible conditions as the original model. 
The general models were then applied to the data obtained on 
the School University Teacher Education Center (SUTEC) 
Observation Schedule. The purpose of this application was to 
make explicit the steps that were required in order to apply 
the more general model to a specific instance. 

To this end, the study investigated and endeavored to answer 
the following questions: 



- 2 - 



1. Which variables and interactions between variables 
were to be expected when dealing with observation schedules 
that were, and may continue to be, used to study classroom 
behavior? 

. 2. What was the difference between a '’random” and a 

fixed” . factor as far as the model was concerned? Which of 

the variables identified in question 1 were "fixed” and which 
were "random?” 

3. Was it possible to consider the original factorial 

model (Medley 8 Mitzel, 1958a, 1963) as a repeated measures 
design: If so, what difference would this make in the 

general analysis? 

4. . Under what conditions could designs different from 
the original design be developed and applied to make them 
more administratively feasible? 

5. What was the relationship between analysis of 

^ and the reliability of an observational schedule 
defined in terms of the variables and sources of variation 
inherent in an observation schedule? 

6. .What assumptions had to be made to make possible 
the application of the general model or models to the 
specific observation schedule data available--namelv , the 
SUTEC data? 

7.. In general, how can the reliability coefficient and 
the variance components be used to estimate the percentages 
of the variance attributable to the factors involved? 

Definition of Terms 

A number of terms employed during this investigation required 
definition. However, the more technical terms which pertained 
to the ANOVA models such as "random,” "fixed,” "finite," 
"crossed," and "nested" factors are discussed and defined in 
the section of Chapter II on Statistical Procedures . The more 
general and less statistical terms that needed to be defined 
were: observation schedule and observation team. 

Observation Schedule . For the purposes of this investigation 
the term observation schedule was defined as a series of 
selected items that categorized and/ or described those 
classroom behaviors of teachers and students and/or settings 
to which trained observers were directed to attend. The 
items were typically formalized into a category scheme and 
prepared in a form (list, grid, etc.) which permitted rapid 
recording of observations. 



Observation Team , For the purposes of this investigation 
observation tean was defined as the group of people" who 
were trained to collect classroom data through the use 
of an observation schedule. 

Significance of the Problem 

It" has been iMicatecf that an ANOVA model to calculate 
reliabilities of observation schedule data has been 
formulated. This theoretical approach, specifically 
geared to the variables present in a classroom observation 
situation, uas first proposed, and subsequently further 
developed by Medley and Mitzel (1958a, 1963). However, 
to date very little use of this ANOVA model has been 
reported in the literature. 

It is believed that the general lack of application of 
this model in the past to studies involving observation 
teams w as in large measure due to the practical 
difficulties involved in its application. Maintaining 
an observation team intact over an extended period of 
time and being permitted to visit the 3ame classrooms 
and teachers a number of times is difficult and rather 
expensive. Both of these conditions, however, using 
only the same observers and the same teachers, are 
necessary if one is to apply the previosly discussed 
ANOVA model. Practical research administration problems 
usually force the researcher to train his team by having 
all the members of the team visit a classroom together 
after they have become somewhat familiar with the 
observation schedule that they will use. Subsequent 
to the first visit a group discussion is then typically 
followed by a visit to another teacher. This procedure 
is followed until there is a fair amount of agreement 
between observers at which point the members of the team 
are sent out individually to observe the teachers who 
are the Ss of the study. The presently available ANOVA 
techniques do not apply to analyses of observer team data 
under these frequently prevailing conditions. 

This study therefore was devoted to the development and 
application of an ANOVA model or models applicable to 
conditions when the observer team did not necessarily 
have the same observers and when the observation of a 
given teacher did not necessarily occur many times. 

The assumptions and procedures necessary in order to 
apply the general model to a specific case we re also made 
explicit. The development of the model makes possible the 
broader and more precise use of observation team data in 
a variety of educational problem situations and thus makes 
feasible more accurate appraisals and evaluations than are 

- 4 - 



currently possible. At the sane tine, the availability 
of another method of measuring the reliability of 
observation data may aid in directing a greater research 
effort to the place where the teaching-learning process 
is carried on — the classroom. 

Limitations of the Study 

This study was limited with respect to the following 
factors : 

1 * Applicability. The inherent complexity of the 
subject puts the reliability calculation beyond the 
present training of many research workers , although the 
administrative requirements have been simplified. 
Therefore, the anticipated greater applicability of 
observation data, in general, and the ANOVA model, in 
particular, may not occur. 

2 . Validity of the Schedule, The models developed in 
this study dealt only with the problem of the reliability 
of observation schedules. At the same time the sources 
of variation, expected mean squares, and the percent of 
variance attributable to various sources were calculated. 
However, the foregoing in no way answers questions 
pertaining to the appropriateness, usefulness, validity, 
etc. of the items comprising the schedule. Therefore, 
the validity of the SUTEC Observation Schedule is still 
highly questionable and although the calculated 
reliabilities may be numerically acceptable they may be 
meaningless. This limitation is clearly a. function of 
the construct validity of specific schedules and is 
directly related to the underlying rationale of each 
schedule and its originator -, s philosophy or theoretical 
framework and is therefore beyond the' scope of the 
present investigation. 

Review of Related Research 

During the last 20 years a number of observation schedules 
which purport to measure classroom behaviors and/or 
settings have been formulated and used. Many of the 
newer instruments such as the Observation Schedule and 
Record (OScAR) developed by Medley and Mitzel (1958b) 
may be considered as refinements or amalgamations of parts 
of previously proposed schedules. OScAR was actually 
based on the category schedules of Withal and Cornell 
(Medley & Mitzel, 1958b, 1963) and was supposed to measure 
emotional climate, verbal emphasis, and social structure. 

A thorough review of the many category schemes and their 
uses for the period up to. 1963 is available (Medley & 
Mitzel, 1963) and need not be repeated. 



- 5 - 



In developing or using an observational schedule a problem 
that must be faced is that of the reliability of the data. 
The problem of the reliability of observation data has 
usually been treated in terms of the per cent of observer 
agreement or in terms of an inter class correlation between 
wo sets of observations • In the latter case a Pearson r 
or a rank order coefficient has usually been used. Not 
only was this so prior to 1963 (Medley g Mitzel, 1963), but 
a review of the more current literature seemed to indicate * 
that this was still essentially true. 

For the purposes of this review the studies dealing with 
the reliability of observation data that were investigated 
were grouped under two main headings : traditional 

reliability calculations, and other methods of calculating 
reliability. . The reports that used the traditional methods 
wili be considered first and will be followed by the studies 
which used methods other than per cent of observer agreement 
and/or the Pearson Product Moment coefficient of correlation 
or its equivalent parametric or non parametric counterparts. 
Studies which did not inyolve classroom situations were 
^ iu the review since the techniques of measuring 
reliability were the central issue rather than the content 
or discipline of the application. This sectioning was done 
to provide a. frame of reference and when a study fitted into 

both categories this categorization was not strictly adhered 
"to* 



Tradition al Metho ds of Calculating the Reliabili tv of 
Observational "Data * — ^ 



The review cited earlier (Medley S Mitzel, 1963) was not only 
replete with the various observation instruments up to 1963 
but . also . contained fairly complete information on the 
reliability calculations of observation data up to that time. 
Because of the great familiarity of many research workers 
with the traditional concepts of reliability and the lack of 
any . additional contribution which might ensue from a second 
review of the period up to 1963, this section was devoted 
only to studies reported after 1963. 

Ojemann and Snider (1964) reported on the development 
and scoring of ah observation form that was to evaluate 
part of the . Preventive Psychiatry Research Program of 
the University of Iowa. The aim of the program that was 
being . evaluated dealt with the construction of curricular 
materials that would help children acquire an 
understanding and appreciation of the dynamic nature of 
human behavior. That is, the materials were to help 



- 6 - 



environment? ^^fulfiii*]*? 1 approaoh to their social 
In "behavioral science" was constructed?^ prograsl 

during^the^pring precedins°the e f° b ?? rVinS were Gained 
training consisted of a groun di^ cn? ? act iJ al use. The 
comprising the observation ^hf^°? ion of the lten ® 
observation of three childr.pn h h, d r Ulej ’v, pr f ^ lininary 
subsequent groun di f ren 5 y each obs erver, and 

items. & P dlscu ssion and clarification of the 

Instrument ^as carried S out StU Two°ob the reliabili1; y of the 

carried out simultanenn? „£«, ™° ^servers, I and II, 

Ss. Similar observation" W pS rVabi ? n ? n 32 fourth grade 
Tl and Hi on a c?!S o? 28 ?i%^ ndU °^ ed by obsa *-vers 
correlation between th? h^Lv^ h srade £ 3 - The 
observers were .69 and 67 foi°£h S °p res of t!le two 
grade Ss respective??; 7 f ° r the fourth and fifth 

of S fte^ n ?? J e“ativi e t S -- ( iHi r ? P ° r \ ed on the development 
of the test was oblective *na 6 ° rana ! io arts. Part I 
and redefinition. The S li?tpnoH P ? r>ted J . to nea sure fluency 
story and was aton a * " b ° a tape-recorded y 

the number of ways' the mltlrifl n?^ht t Th ® - , then listed 
on a play of the story pwl^- ^V 0 ? used in Putting 

number of , responses given P i?d rpa»?^ d f berlnined by the 
as the number of unnfn*i d rec ^ e finition was scored 

A similar prL?L??“waf U sed S for g f n? 6d f °S the ob Ject. 
and some mood music. U d f ° r a pieoe of driftwood 

description^? how^e^ould^roduce^-' 5 wri p ins a shor t 

story he had heard p^„4. " ^ dMe a scene from the 

scale rb P originality Ind sensi^i^v >,° n - a five point 
evaluators. Two grouns fron ^L by unree independent 
matched on IQ? refdinP a ch?pv^ ^ 1Xth 3rade Masses were 
eleven and twelve ve?f ??!?» , Tf 6nt i and sex - ^7 

of the test in the fall and n . alternate forms 

reliability of ' Fa?t T of of 1959 ' The 

The reliability ^f Part f TT h , rang ? d fron -38 to .67. 
on the part of the raters ’was ?iven q ?? r ? d valu | Judgments 
agreement. Per cent of agreement ln < tl em ? of P er cent 
raters I and tt t and ttt* ?i ent , was Siven in terms of 

r; - 

ass ^ 1| S®S .SS r SS a .2 S”£SpS2S*“^; 



- 7 - 



tool. He indicated that this problem wan essentially « 
question of the reliability of the data! Tolect his 

al ! out interscorer reliability and stability 
of data based on inferences he studied the reliability 

sinu?at*dTl ° f tralned observers on sables of ' 

_ 4 ?^ a i ec l behaviors . The Sp were 64 hi^h ^honi noninr»«i 

topic°of ”^pn te a t a ^ i ? ned .Projective eZ*w~on~thl~'' 
4 A Teena ser*s Advice to the World" and an 

Cards l” 4 GS ?n 1 ] S on t0 J h fu Slm u ltJane0US Presentation of 
Sup ? nd d0 * of the Thematic Apperception Test. 

The three rajers were trained in the use of the nine 

the thrcF^us: FaCt ° rS RatlnR Snn1p whlch oon tained 



1 . 

2 . 

3 . 



How does this person see himself? 

To What extent iH person identified with others? 

io what extent is tnis person open to his experience? 

Each observer’s ratings were correlated bv caleulpti no* 

completed A n0nth aftel ’ the^atings werl 

completed each observer reccored a sample of 10 of t-hp 

essays and a correlation between initial and final 
r ?ft i ? 3S Sf calculated. The interscorer r's ranged from 
’.72 to *84 an<3 the stabillt y coefficients ranged from 

ratini^cale^ Ve f * Sa f e ^ the raliabillt y of adjective 
constructed fi.?f aled ^ftation rating scale was 

who werffanUifr w??h ! th! \ a ° onnitbee of interviewers 
the trai tnthit^ I lb £ tbe 0° b to be performed established 
of on the -tew £ I f to b ,f P er i ormed . Second, examples 
fL ? th f J b behav iors which illustrated high, average 

traits W wenf r n eS 1 ? f t ? e tPait were wrltten . Third, thf ’ 
traits were reallocated back into traits and levels bv 

conn?ft dent Judse ?” Fourth, only those examples witl/ 
complete agreement as to trait and level were retained 
Finally, the remaining examples we re arranged on a 
continuous vertical scale with each examplf at its 
proper scaled level for the trait. An interview guide 
with the weights for each trait at the back of thf trait 
pages was prepared for use with the scaled ratings. 

adieotwo compared the reliability of the traditional 

nahloa If/c lnf ? S ? ale wlfch the scaled expectation ratine 
method and found significant differences in favor of the 

latter. During the first year of study 36O Cornell 
^^fl? lty 1 undersraduate3 were interviewed twice in the 

differed"^ - n hli er i The ^' uestions °n the second interview 
fl ffe ^ 6d s i, 1 3htly iron those on the first Interview and 
therefore the inter-interviewer reliability actually 
consisted of inter-interviewer and S reliability. The 

- 8 - 



Pearson correlations were .35 for the trait scores, .34 
for the overall rating, and .34 for the grand total score. 
The following year, using the sane procedures, 500 
candidates were interviewed using the scaled expectation 
rating technique. The correlations this tine were .58, 
•47* .55. Subsequently, the scaled rating technique was 
used by three interviewers, interviewing 188 and 172 
candidates for female and male dorn counselors, 
respectively. When interview guide and candidate 
reliability were held constant, inter-interviewer 
reliability was found to be .69 for the trait scores, .65 
for the overall rating, and ,72 for the grand total. 

Bobbitt, Gordon, and Jensen (1966) studied the continued 
inter-observer agreement of pairs of observers for a two 
and one-half year period. The data were collected on three 
groups of four mother-infant pairs of pigtail monkeys 
for five random 10 minute samples of behavior per week 
for 26 weeks, a pair of observers simultaneously scored 
the behavior of mother and infant for one of the 10 
minute periods per week. Prior to this study the 
observers were required to attain a .75 agreement 
percentage criterion where agreement percentage is equal 
to the ratio of the responses agreed on to the total 
number of responses. Following each observation an 
agreement percentage was calculated and a discussion of 
the observation was held. The dimensions measured were 
position, posture, locomotion, visual, oral, and 
manipulation. For all groups the total agreement • 
percentage was .79 with all dimensions ranging from .75 
to .89 except for the visual dimension which was -.55, 

Zunich (1966) studied the relationship of child behavior 
: ; nd parental attitudes of 18 boys and 18 girls whose ages 
in years and months ranged from 2-9 to 5-0. Direct 
observation of the children, utilizing a time sampling 
technique of five minute duration and predetermined 
categories, was c mqucted through a one-way mirror. The 
categories observ ^ were: asking permission, contact, 

cooperation, criticism, directing, indications of anxiety, 
interference, non-cooperation, playing interactively, 
praise, remaining out of contact, restricting, seeking 
attention, seeking contact, seeking help, seeking 
information, seeking praise, and suggests. 

Reliability of the observations was calculated in terms of 
the percentage of agreement between two observers who 
recorded the behaviors of children who were not included 
in the study. Behavior was recorded • simultaneously and 
independently by two observers during 30 five-minute 
periods with an observation being made every five seconds. 



atvt 



- 9 - 



The number of agreements divided by 60, the total number 
of observations for a five-minute period, was equal to 
the percentage of agreement . For the 150 minutes of 
observations , the 30 five-minute periods, the reliability 
ranged from .83 to .97. 

Bloom and Wilensky (1967) constructed an observation scale 
to measure the behavior of teachers. The scale was based 
on a Skinnerian framework and contained the following 
four categories: information giving, response elicitation, 

feedback, and teacher control. 

The Ss of the study were 72 underprivileged nursery 
children. Each observation lasted five minutes and was 
prorated if the activity ceased during the observation. 

For each of the observational categories, the inter-rater 
reliabilities, based on 2 6 five-minute observational 
periods exceeded .90. 

In the development of the Behavior Survey Instrument, an 
observation sheet geared to Head Start and other special 
programs in early childhood education, Katz, Peters, and 
Stein (1968) used the agreement percentage as their 
measure of reliability. An overall agreement percentage 
of 84.6$ was attained by simultaneous independent 
observation of the same Ss and was based on the seven 
categories which were; Fask orientation, satisfaction, 
motivation, cognitive, motility, interpersonal behavior, 
and situation. The range for the categories was .64 to .98. 

Brown, Mendenhall, and Beaver (1968) developed the Teacher 
Practices Observation Record (TPOR) which attempted to 
measure the agreement of teachers' observed classroom 
behavior with educational practices that were advocated , • 
by John Dewey. The TPOR had seven categories and contained 
a total of 62 items. The categories were: nature of 

the situation, nature of the problem, development of ideas, 
use of subject mattes*, evaluation, differentiation, and 
motivation control. Five filmed lessons were observed 
in 1964 by 130, 124, 119, 119* and 67 observers who 
received only a 10 minute explanation of the instrument. 

The observers were draxvn from two large midwestern 
universities and e'ast coast and west coast teacher 
training institutes. The observer judges were 
occupationally, college supervisors of student teaching, 
education professors, and academic professors. No 
significant differences were found between any of the 
groups on films 1, 2, 4, and 5. There v/as a significant 
difference on film 3 between supervisors of student 
teaching and both education and academic professors. 



- 10 - 



In 1965 films two and four were observed once again by 69 
and 72 of the judges. Pearson coefficients for the 
observers 1 total scores within a given viewing and for 
the repeat viewings were calculated and ranged from .86 
to .93 and .27 to . 65 9 respectively. Correlations for 
each 10 minute segment of each 30 minute film were also 
given and ranged from .32 to .71. 

Summary of Literature on Traditional Methods of Calculating 
Reliability of "Observational" Data 



In summarizing the research reported in this section, one 
must be mindful of the fact that each of the studies used 
a different research design. This was as it should be, 
since each investigation w as essentially considering a 
different problem. There were differences in the 
instruments employed, the number of subjects who 
participated and the hypotheses being tested. 

With the exception of one study (Brown et al., 1968), 
which will be considered again in the next section of this 
report, all of the studies reviewed we re concerned with 
the reliability of their observational data as a 
secondary problem. Because other problems were of primary 
importance, reliability considerations were often treated 
in a superficial fashion. These studies all had in common 
their traditional method of calculating reliabilities, i.e. 
percentage of agreement or Pearson r, or their equivalents. 
The question of whether or not these methods or reliability 
calculation were the best ones available, or should even 
have been employed were for the most part ignored or at 
best cursorily treated. 

Recent Methods of Calculating the Reliability of 
Observational Data 



Scott (1955) developed an index of interscorer agreement. 
Pi, for nominal scale coding. That is, Pi_ was to measure 
Inters corer agreement when the coding dimensions were not 
ordered along equal intervals or along a dimension of 
"more or less" of some attribute. The index. Pi, was to 
be used in survey and observational research where the 
typical procedure had usually called for one coder to 
analyze and code interview data. The data were then 
categorized by a second rater, and then a comparison 
between the two analyses was made. These analyses were 
followed by a -conference between the two raters to enable 
them to arrive at their "best" judgment. 

The vilue of Pi was equal to the ratio of the difference 
between the percentage of actual agreement and the 



the agreement expected on the basis of chance to the 
difference between maximum chance agreement and agreement 
expected on the basis of chance. 

Symbolically, 



Pi = (P 0 - P e )/(1 - P e ) 

where P 0 was the percentage of agreement between the two 
independent analysts, and Pe was the per cent of agreement 
to be expected on the basis of chance. 

The expected per cent agreement for the dimension was 
equal to the sum of the squared proportion over all 
categories , 

Symbolically, 

k o 

P„ = SUM(p£) 
e i=l 

where k was the total number of categories and pj. was the 
proportion of the entire sample falling into the i^ 1 
category. 

As an illustration of the method Scott (1955) calculated 
the value of Pi for the question "what sorts of problems 
are your frieri3s and neighbors most concerned about these 
days? " 



Nature of Problem Per Cent of All Responses 



Economic problems 6 0$ 
International problems 5$ 
Political problems \0ffo 
Local problems 2C$ 
Personal problems 3$ 
Not ascertained 2fo 



Therefore, Pg = (.60)^ + (.05)^ + (.10)2 + (.20)^ + (.03)^ 
+ (.02) 2 = .41. On the basis of an assumed 80^ agreement 
between observers, the index of inter-coder agreement was: 

0 

Pi « (.80 - .4l)/(l - .41) = .67 

In his discussion of observer reliability for his verbal 
category system, Flanders (i960) estimated interobserver 
agreement through an adaption of the Scott (1955) 
coefficient, Pi_. Rather than actually using the formula 
to calculate P e and Pi, he (Flanders, i960) developed 
approximations to P e which we re based on graphic estimates. 



- 12 - 



The graphic estimate of P e was then followed by a graphic 
estimate of Pi . 

Two observers were trained to use the Flanders system 
which contained seven teacher, two pupil, and one general 
category as follows: Teacher accepts feeling, teacher 

praises/encourages, teacher accepts/uses ideas, teacher 
asks questions, teacher lectures, teacher gives directions, 
teacher criticizes/justifies, student responds, student 
initiates, and silence/confusion. The proportion of 
tallies of the observers in each category was found and 
was used to calculate P e . The value of P e was also 
estimated graphically for both observers. The values of 
Pi using the calculated and estimated value of P e were 
•o55j *°53j snd .854, respectively. A critical ratio 
comparing these values was not carried out, although such 
a critical ratio calculation was possible (Scott, 1955), 
because it was obviously unnecessary. 

Furst and Amidon (1967) used Flanders 1 interaction analysis 
to investigate differences in interaction patterns between 
elementary school teachers of different subjects and of 
grades one through six. One hundred sixty classroom 
observations, one-third in "ghetto," one-third in suburban, 
and one-third in urban "middle" socioeconomic level 
schools, were carried out. There were a' minimum of 25 
observations at each grade level with at least five 
observations in the areas of arithmetic, social studies, 
and reading at each grade level. 

The observer was trained and then practiced categorizing 
tape recordings of actual classroom sessions until a 
Scott coefficient of intra observer consistency of ,99 
was attained. The observer then observed three classroom 
situations with different trained observers present 
during these visits. The interobserver reliability 
coefficients were .90, .87, and .92 for the three 
simultaneous observations. 

The results showed that first, second, and sixth grade 
teachers did. more talking in social studies than in other 
subject areas. Third, fourth, and fifth grade teachers 
did more talking in arithmetic. Student talk was lowest 
in grade one and two and highest in grades three, four, 
and five in social studies. 

Medley and Mitzel (1958a) developed and applied an ANOVA 
technique to the reliability of observational data. The 
model assumed that H teachers were visited m times each 
by a team of n observers. The assumption of linearity of 
variances was explicitly stated by the equation 




- 13 - 



X 



ijk 



T. 

l 



V. + I 



ij 



+ G 






in which all the variables actually represented deviations 
iron their respective means. The variables in the 
equation were defined as follows: X*.n { was the 

deviation from the mean of all values J as signed to teacher 
i during visit j by observer k. T^ was the deviation 
from the mean of all observations associated with teacher 
i, Vj the deviation associated with visit j, the 
deviation of the interaction between teachers and visits, 
and e-ji^ the deviation of the residual for teacher i on 
visit in observation k. Based on these definitions, 

I*ij was viewed as visit error for teacher i on visit j 
ana e^ as residual or observer error. Error was 
therefore considered to have two components. The first 
was due to a lack of stability of teacher performance and 
the "observer 1 ' error resulted from the discrepancy between 
two records of the same teacher performance made by two 
observers . 



The above equation permitted the taking of mathematical 
expectations and yielded 

2 2 2 2 _p 

a x 85 °t + a v + °tv + ^ 

wherg Ox was the total variance for^all the observations 
x, ag was the variance of the T*x, o r § of the V.;, crfL of the 
l xj , and o£ of the e^ J 

Based on the above, a reliability coefficient based on a 
single observation was given as 

R = cr^ / (o^r + o| v + o 2 ) 

where the numerator on the right, a?, was the "true score" 
variance. This meant that the true score of interest was 
T-j^ the mean of all performances of teacher i on all 
occasions j on which a visit was possible. The authors 
(Medley & Mitzel, 1958a) indicated that the cr^ variance 
component was removed because they compared teachers who 
had been visited equally often. Since the scores we re 
means over all visits the visit effects cancelled out. 

0 

A second reliability coefficient, R 1 , was defined as 

R ' = (o t + °tv> / (°t + °tv + 

in which the true score was considered to be the performance 
of teacher i on visit j. This coefficient was actually 
equivalent to a coefficient of observer agreement which 
usually is calculated as ~a Pearson r. Here, the 



- 14 - 



^ ea 9 ^ er Performance were considered part 
or true score variance because they were observable by 
all observers present on a particular occasion. 

The reliability of the mean of a number of scores assigned 
o the same teacher, , was defined in terms of observer 
team size n and number of visits m 

Rjnn = mnof / (nncf + nof v + or 2 ) 

Estimates of R, R f , and R were made from an ANOVA which 
was based on the assumptions that Ti, V., I, ,, and e n - 
were normally and independently distributed in repeat^ 
random sampling with zero means and with variances of 

°Y' °tv' and <jd > respectively. The AITOVA is 
reproduced in Table 1 , 



Table 1 

Medley and Mitzel Reliability ANOVA 



Mean Squares 



Source of Variation d.f. Observed- Expected 



Teachers 


N-l 


■J 


a 2 + na| v 


+ Mno 2 


Visits 


m-1 


S 2 

V 


0 ^ + no 2 
tv 


+ Nno^ 


Visit Error 


(N-l)(n-l) 


s 2 

. °tv 


+ no tv 




Observer Error 


iin(n-l) 


2 

s 


a 2 





The ANOVA technique was then applied to the Cornell and 
. .all techniques. In the first instance, six observers 
visited 33 teachers in teams of two such that each observer 
visited each teacher once. In the second* two observers 
visited four teachers eight times. 

The modified Cornell schedule contained the following 
eight scales.* activity, variety, pupil climate, teacher 
climate, social organization, differentiation, pupil 
initiative, and content. Both the reliability coefficient,, 
R, and the coefficient of observer agreement, R 1 , were 
calculated for each scale and were: .41 and .63, .42 

* aA *°? a i? d * 00 ' and .32, .37 and .66, .35 and 
.04, .00 and . 43 * and .00 and .23* respectively. The 



- 15 - 



modified V/ithall categories were: learner-supportive, 

problem-structuring, neutral, directive, reproving and 
climate index. The reliability and observer agreement 
coefficients were .25 and .90, .50 and .98, .00 and .50, 
.50 and .97, .00 and .88, .47 and .96, respectively. 



In their later paper (Medley & Mitzel, 19 & 2 ) in which the 
measurement of classroom behavior by systematic observation 
was discussed, the ANOVA technique for measuring the 
reliability of observations was further elaborated. This 
more complete analysis is given in Table 2 and assumed 
that scores were available for class _c on items 
recorded by r observers on s_ visits or situations. A 
typical score was therefore indicated as X cr j_ s . 



The adaption of this ANOVA to a specific instance was 
dependent on three rules. The first was to substitute 
specific numerical values for literal ones, drop any 
line with zero degrees of freedom, and change the last 
remaining line to ''residual. 11 The second was to omit 
from the expected mean squares in the remaining lines 
all the components whose line had been dropped and also 
the component that corresponded with the new "residual." 
The. third rule was to omit any component in any of the 
remaining lines that contained a subscript of a "fixed" 
variable. This rule applied to all bub the first 
component on any line which was never omitted. A "fixed" 
variable was a variable without an infinite number of 
values in the population. 



The calculation of the reliability of observational data 
was based on the standard definition 



Rho = / o| 

Based on q recorders, j items, and t situations (referred 
to earlier as r, i, s, respectively) the variance of the 
population of the true scores was defined as 



2 / , . \2 2 
c ^ = (qjt) Oc 



2 



where X ccn -+. was the "true score" and cf c w as the first 
conponentr shown in Table 2 . The general expression for 
the variance of the actual scores of all the teachers 
in population about their own mean, <7^, was defined as 

of = qjt(qjta 2 +jto^+qtaf+qjaf+jtof r +qtaf i 

s +to^i+jof s +qof s +to^ i+ja| rs +qo§i s 



+01 



ris 



i+o 2 ) • 



-16- 



Table 2 



Medley and Mitzel Expanded Reliability AtfOVA 



Source 

of 

Variation 



d.f . 



Obtained 

Mean 

Square 



Expected 

Mean 

Square 



1. Class (c) 



2. Recorder (R) 



3- Item (I) 



4. Situation (s) 



c-1 



r-1 



i-1 



s 



s 



s-1 



1 

2 



risp|+ iS of r +r S o-2 

+r *^s +s ?cri' ! ' io ’crs 

+rog is +o 2 

cisc^+lsofr+osofj 

+cio^„+sg^ « +io^ 

+0 o$™+o$° ri ^ rs 

cr sc^+r s cr ci +c s 

+or ?is +s ?cri +rcr §i 3 
+co2 is+(y 2 

crlo^+ria^„+oia? 

S 



+cro§L+icr? +rcy§j 



oi s q 
+00^. +cn 



CIS 



5. 


C 


X 


R 








(c-l)(r~l) 


2 

3 or 


ris 

isa2 +gcj2 

cr cn 


-J.-K.rr2 

. °o1b 


,4 -afi 

\ 


6. 


c 


X 


I 








(c-l)(l-l) 


<5^ 

^ci 


rBO§i+Bo2 rl 


+ra^, 

CIS 


40 2 

> 


7. 


c 


X 


S 








(c-l)(s-X) 


s 2 

cs 


ria cs +ia crs 


+ra fi3 


40^ 


8. 


R 


X 


X 








(r-l)(i-i) 


s 2 

ri 


csa ri +sa cri 


+0<3 ris 


4-0^ 


9. 


R 


X 


S 








(r-l)(s~l) 


s 2 

s rs 


cior^+icr^ 
rs crs 


+CO^„ 

ris 


+a 2 


10, 


X 


X 


S 






(c- 


(l-l)(s-l) 


0 2 

s is 


crof +r cr. 

IS CIS 


2 

4ca . 
ris 


4cr 2 


11. 


C 


X 


R 


X 


X 


•l)(r-l)(i-i) 


s cri 


s0 fri +<r2 




12. 


c 


X 


R 


X 


s 


(c- 


■1)(e-1)(s-i) 


^2 

s crs 


la frs +<j2 






13. 


c 


X 


I 


X 


s 


(c- 




2 

s cis 


ro^. 4a^ 

CIS 






14. 


R 


X 


I 


X 


s 


(r- 


l)(i-l)(s-l) 


2 

s ris 


c^ris** 72 






15. 


Residual 


(e-l)( 


r~l)(i-i)( s -x; 


k 2 
) s 


2 

<T 







The rule for adapting this expression to a particular 
instance was to drop any component whose subscripts 
remained constant in all obtained scores. For example, 
if the same items were used in all classes qtof would be 



dropped from the equation defining 



cr* 



x 



2 2 

Once Grp and ov have been defined, linear equations were 
obtained by setting the actual mean squares, which were 
unbiased estimates of the expected mean squares, equal 
to their respective expected mean squares. The set of 
linear equations thus obtained was solved and yielded 
values which estimated the parameter values from the 
sample values. The results were as given in Table 3. 



The model was then applied to data available on "pupil 
interest" scores from 0Sc/\H3P» The data were collected 
by two observers in five situations in 24 classes. ' The 
given application first considered items and situations 
finite and recorders infinite, and then items finite and 
situations and recorders infinite. 



It was pointed out that the proposed reliability calculation 
did not require the assumption of normality of the 
distribution. This was so because the expected mean 
squares, upon which the reliability calculation depended, 
did not reemire that o^e assume & normal 



rj i s tr 1 b nt ion 



However, often one wished to test hypotheses regarding 
the value of the components for which the assumption of 
normality was required so that F tests could be made. 



Denny (1968) reported on the reliability and validity of 
the Denny Rusch, Ives Classroom Creativity Observation 
Schedule. The schedule was constructed to identify those 
teacher-pupil variables which were related to pupil gain 
on creativity measures and contained three dimensions: 
climate, general structuring, and specific structuring. 
There were 11 items comprising the schedule. These we re: 
motivational climate, pupil interest, teacher-pupil 
relationship, and pupil-pupil relationship-climate, pupil 
initiative, teacher approach, and adaption to individual 
differences, variation in materials and activities — 
general structuring, encouragement of pupil divergent 
thinking, encouragement of unusual pupil responses, and 
uniqueness — specific structuring. 



Thirty sixth grade classes in a Midwestern state were 
visited three times by trained observer teams of three 
recorders. The observations were made between pre and 
post-tests on adaptations of Guilford’s tests. 

Both of the Medley and Mitzel (1958a, 1963) techniques 

-18- 



Table 3 

Medley and Mitzel Estimation of Variance Components 



1. 


4 


(=) a 


1 

ris 


/22222 2 2 2\ 
( s c -s cr" s ci -s os +s cri' i ’ s cis +s ers~ s > 


2. 


4 


(=) 


1 

cis 


,22222 2 p 2. 

( s r " s cr" s ri- s rs +s cri +a ri s +s ors- s ) 


3. 


2 

CT 1 


(=) 


1 

crs 


.22 ^ 2 2 2 2 2. 

(s 4 -s . -S ,-S. *fS •+S ft . (1 +S . -s ) 
v i ci ri is cn cis ris ' 


4. 




(=) 


1 

cri 


, 2 2 2 2 2 2 2 2, 

( S -S -S -S +S +8 . +S „ -S ) 

v c cs rs is crs cis ris ' 


E 

^ • 


^or 


(=) 


1 

is 


, 2 2 2 2\ 

(s -s . -s +s ) 
v cr cn crs 


6. 


_2 

v ci 


(=) 


jl 

rs 


.2 2 2 2 
( s . -s . -s . +s ) 
* ci cn cis 


7. 


o2 

cs 


(=) 


1 

ri 


, 2 2 2 _ Lo 2\ 
( s e.s~ s pi‘s’" s cis + ' 


3. 


J2 

CT ♦ 
n 


(=) 


JL 

cs 


, 2 2 2 , 2\ 
^ s ri -s ori -s ris’ ra > 


9. 


a 2 

rs 


(=) 


JL 

ci’ 


(s 2 -s 2 -s 2 . +S 2 ) 

rs crs ris 


10. 


°1b 


(=) 


JL 

cr 


, 2 2 2 2, 
^ s is _B ois"’ s ris 3 


11. 


°ori 


(=) 


1 

s 


(s 2 ,-s 2 ) 


12. 


-2 

crs 


(=) 


1 

i 


(a 2 -s 2 ) 

' era ' 


13. 


cr 2 . 

CIS 


(=) 


1 

r 


,2 2. 
(. 3 cia- s ) 


14. 


a 2 

°ris 


(=) 


1 

c 


, 2 2. 
( S - -S 1 

•ns J 


15. 


2 

a* 


(=) 




2 

s 



a The symbol (=) is to be read !, is estimated by." 




- 19 - 



were used. The latter model to estimate the total 
reliability of .42, and the former to calculate R, R ! , 
and R nn for each of the 11 items comprising the 
schedule. The values of R ranged from .15 to .72, of 
R* from .40 to 1.00, and of R pn from .38 to ,91. The 
author (Denny, i960) concluded that the reliability 
estimates were at least as good as those obtained for 
similar schedules but that the validity estimates 
indicated a need for further analysis of the dimensions 
and items of the schedule. 



Medley (1967) described a new way to score the OScAR 4V 
bo that more meaningful information could be obtained 
from the raw scores. The method depended on an ANOVA 
technique which permitted the partitioning of variance 

of orthogonal contrasts. The data were 
observer who visited 70 teachers four 
20 minutes per time. These scores were 
scores collected a week or two later on 
four more visits. The correlations were estimated for 
each scale by the 1958 ANOVA technique for the four 
"entry" and six "exit" categories. The entry categories 
were: pupil initiative, cohesion, divergence and a total 

score. The exit categories were: feedback, valence 

enthusiasm, positivity, encouragement, and a total score. 
The intercorrelation between the total scores was .73 
and the range for the other intercorrelatxons was 0 to ,76. 



through the use 
collected by an 
times for about 
correlated with 



The rationale for the use of orthogonal contrasts to 
develop scoring keys was that the transformed scores 
remained linear independent functions of the original 
raw scores while at the same time the contrasts yielded 
information of specific interest to the investigator. 

This was so because the contrasts could be chosen in a 
very large number of ways. It was therefore incumbent 
on the investigator to choose the set which best answered 
whatever questions interested him most. 

Brown et al.(l968) calculated "between observer," "within- 
observer" and internal consistency reliability as well as 
the correlations mentioned earlier in the preceding section 
of this chapter (see p.10). The authors (Brown et al«, 
1968) pointed out that since they were using "untrained" 
observers the Medley and Mitzel (1963) ANOVA technique 
was not suitable because the § ANOVA gave a reliability 
measure of "between observer" variability while what was 
needed for their data was a "within observer" reliability 
coefficient. Accordingly, a within observer reliability 
coefficient for the repeated viewing of the films was 
derived. The within observer reliabilities ranged from 
.48 to .62. ’ 



- 20 - 



The derivation was based on the rationale that if the same 
observer scored the same teaching situation twice in the 
same way , then the judge’s scoring was reliable. To 
accomplish this end, the ratio of two different values for 
the variance of the difference between the first and 
second viewings was derived and constituted the 
reliability coefficient. 

The difference d-*, was defined as 

d i " x n “ x 2i 

where 1 and 2 refer to the viewing and the i refers to the 
items. 

Then, for independent scores 

v(d 1 ) = v(x u - x 2i ) 

= V(X U ) + V(x 21 ) 

= O 2 + a 2 = 2 a 2 
However, for correlated scores 

VC^) = V(x u - x 2i ) 

= V(x xi )+ V(x 21 ) - 2 Cov (X li ,X 21 ) 

= 2CT 2 - 2<T 12 = c r 2 

These two formulas were based on the assumptions that 

(1) V(X, .) = o 2 for 1 - 1,2 

j “ 1 , . . • » n 

and that (2) P(x) = 1 

k 

where the probability of selecting a particular item, p(x), 
was equal for all the possible choices, k. The reliability 
coefficient was defined as 

T -, 1 -2 / — 2 

iHisijf ■ x ~ °d / ^ 

The value of cr- was treated as a constant because of the 
assumption of random choice of each item x on the part of 
the judge and was calculated as 

cr 2 = STM(x-u) 2 p(x) 
x 

- 21 - 



4 



The actual reliability calculation was 



then given as 






1 - sjj / 2a 2 



2 



where sjf, the sample value, estimated cr^ 



Rho and its statistic, rjf, were equal to the difference 
Between perfect correlation and the ratio* of the difference 
between the variance for independent scores and the 
covariance of dependent scores to the variance of 
independent scores. As a result, for independent scores 
Rho becomes equal to zero. This was exactly the formulation 
Fhat Brown et al. (1968) wanted because they were interested 
in a measure of agreement within the observers. The 
greater the agreement the greater the value of Rho and r^.. 
Mathematically* this can be seen easily because it can be 
shown that 




and therefore Rho is directly proportional to the covariance. 



Item reliability was calculated through Kuder -Richard son 
formulas and ranged from .77 to .81. ''Between observer" 
reliability was reported as fair. 

Seibel ( 1967 ) investigated whether it was possible to 
predict the classroom behavior of teachers. The Ss were 
100 graduate students with liberal arts backgrounds who 
we re enrolled in the Harvard Graduate School of Education 
in 1954. The Ss were rated by the classroom teacher in 
whose room they had their teaching praeticum and by their 
university supervisor on eight criteria of teacher 
behavior. The criteria were: rewards, support, contact, 

movement, service, compliance, suggestions, and humor. 

The ratings of the S_s were adjusted for "reliability” 
by asking each rater to indicate the "amount of confidence" 
he had in each of his ratings. Confidence in ratings was 
indicated on a seven point scale from "complete confidence 
that rating is accurate" — 7* to "no confidence whatever, 
just a guess" — 1. The estimates of confidence were 
treated as estimates of reliability according to the 
following scale: 

Confidence Rating Reliability Estimate 

7 1.00 

6 O .83 

5 O .67 

4 0.50 

3 0.33 

2 0.17 

1 0.00 



The reliability estimates were then used to adjust the 
behavior ratings according to the formula: 



x f = rx + (x - rx) 

where x* = the estimated true rating 

r = the estimated reliability of the obtained 
rating 

x = the obtained rating 

x = the mean of the obtained ratings for the 
group 

4?!? e 33 Savior rating scores were then c rrelated with 

m^. 12 ^? 37 ® 033 - 0 '? 01, variables which were: Miller Analogies 

Test, Minnesota Teacher Attitude Inventory score (MTAl), 
F-scale, Minnesota Multiphasic Personality Inventory 
Paranoia, ^ Psychastenia, and Social Introversion— 
Extroversion Scales, Wickman Schedule ”no consequence” 
and extremely grave consequence” Pupil Misbehaviors, 
Previous Teaching-Leadership Activities with children. 
Practice Teaching Grade, change in MTAI, change in F-scale. 



order , multiple, and canonical correlations were found 
and led Seibel (1967) to conclude* that ”. . , to a degree 

it may be possible to predict how a teacher will behave “ 
in the classroom,” 



Summary of Literature on Re c ent Meth ods 
Reliability of observational Data 



of Calculat i n g 



Tne papers ‘reviewed in this section were different from 
those in the review of traditional methods in that they 
were more involved with the problem of the reliability of 
observational ^ data than the studies reviewed in the 
previous section. The most heuristic and technically 
advance method of calculating reliabilities was that of 
Medley and Mitzel (1958a, 1963). 



Of the other work presented, one study developed a 
reliability coefficient which was actually a percentage of 

1955 ) while another developed a reliability 
estimate based on confidence ratings (Seibel, 1967). Two 
other studies adapted or used the Scott coefficient 

F , Ur ? t & Anidon > 1967), While two studies 
t S e AN0VA techniques of Medley and Mitzel (Denny, 

1968; Brown et al., 1968). Of these latter two studies, 

Denny actually used both ANOVA models without any 
adaptation or change. Brown et al. also used the ANOVA 
h kut at the same time developed their own 

within-observer” reliability coefficient. 



- 23 - 



Sunnary of Related Literature 



The research in 



this 



chapter was categorized under two 



headings whicn dealt with traditional and other methods 
ox° estimating reliabilities of observational data. The 
usual method of calculating reliabilities was found to 
be tne Pearsonian correlation or the percentage of 
agreement or their equivalents. Only two studies other 
than those of Medley and Mitzel used an MOV A technique. 

?5 e iculty in applying the factorial model, besides 
Y ; Jj . le oretical complexity, was due to its administrative 
cifficuiun.es. These difficulties resulted from the 
requirement that the same observers visit the same 

? han once ‘ The Present Investigation 
• i ha de e i op Procedures which would permit 

"ituations estlnates under> the n °re typical field 



Chapter II 

The Subjects , Materials, and Procedures 

The purpose of this investigation was to st.udy the 
relationship of the variables which were present during 
observations that, were carried out by members of a team. 
This information was to be applied ko that the reliability 
oi this type of data could be estimated through an MOVA 
technique or techniques under different conditions. The 

model was then to be applied to the SUTEC Observation 
Schedule. 

The aims of this section were: (l) to describe the 

su H e cts an observational study in general, and the 
subjects who participated in the SUTEC study in particular 
(2) to describe the materials; (3) to indicate the 
procedures which were followed; and (4) to present the 
statistical bases for the analyses of the data. 

The Subjects 

foaling with observational data usually has two 
diiierent sets of subjects, those being observed and 
those doing the observing. At different stages of the 
Investigation, one Is interested in first one of these 
sub ^ects and then the other. For the purposes 
of this study, those being observed were the teachers 
and those doing the observing were the people who 
constituted the observer team. 

At the beginning of an observational study the major 
problems are those which pertain to the observers and 
their ability to see and report accurately that which • . 
they have been instructed to observe. This investigation 
addressed itself to this question of the reliability of 
observations and therefore the subjects under 
consideration were the observers. 



The training and employment of a team of observers is 
usually expensive. For this reason, observer teams are 
generally restricted to 10 or fewer members. The 
number of teachers visited by the team, when reliability 
is to be established, is also generally less than 10. 

The maximum number of teachers actually visited, as 
found in the review of thelliterature in the previous 
chapter of this paper, was six. 

There were 10 people who v/ere trained and acted as 
observers for the SUTEC project. All of the observers 
were graduate students in education, related areas, or 
their equivalent. The five teachers who were observed 
by the team, for the reliability study, were all 
regularly licensed New York City teachers who had a 



- 25 - 



w* 



minimum of three years of teaching experience. 



The Materials 

The general model presented in this investigation is 
applicable to many different types of observation 
schedules. The types of materials involved fall under 
the generic definition given in the first chapter of 
this report. For examples of schedules to which the 
model is applicable , the reader is referred to Chapter I* 

In terms of the SUTEC data, the observer team observed 
only the following seven categories of behavior: teacher 

mobility, involvement of children, materials present, 
materials in use, directed behavior, spontaneous behavior 
and irrelevant acts. These items are brief ly described 
below. The underlying rationale of the schedule and more 
detailed descriptions were given by Chapline (1968). A 
copy of the schedule is attached (Appendix). 

Teacher Mobility . The number of different positions 
occupied by the teacher during the second five minutes 
of each learning activity — indicated on a room sketch. 

Involvement of Children. . A global judgment of the 
attentiveness of the whole class during each learning 
activity— assessed on a three point scale from 
uninvolved (l) to highly involved (3). 

Materials Present . The number of different materials ... 
pres eht‘ during the entire observation — checked on a list 
of materials* 

Materials in Use . The number of different materials in 
frse during the entire observation — checked on a list 
t>f materials . 

Directed Behavior . The number of times during each 
activity" that the teacher called on pupils without the 
pupils first indicating a willingness to respond. 

Spontaneous Behavior . The number of tines that the 
pupils indicated a willingness to respond before being 
asked to do so, plus the number of times that the pupils 
responded spontaneously before permission was granted. 

The score on this category w as weighted in a ratio of 
1:2, respectively, before being added. Raising hand 
behavior would be scored as a one while calling out the 
answer would be scored as a two. If both occurred 
during the same activity, the activity would be scored as 
a three provided nothing else happened for the duration 
of the activity. 



-26- 



Irreleva nt Acts. The nnmhn« ^-r* „ „.u _ _ 

not relateJ~fco~ the leamin® a?tivl| S ? ? ov ? nents obviously 
selected children. nln ° actiV3 ' t y of twelve randomly 

Svolvemen^of S6Ven cate S°ries. 

while the other six cate^it h f traa one to three » 
into the schedule and we?e 

The Procedures 

the reilabLity P of n three P of°the r L? aS t ? 6 estin ation of 
considered for inclusion in thf categories that were 
schedule. These itens w!» r th ® . * a " al fo ™ of the SUTEC 
irrelevant acts? P?r this in volvenent, and 

observers visited twn *■ a S relia bility estimate, seven 

categories? ong rated then °" three 

second visit were als£ observers who made the 

ere aiso Present during the first visit. 

entire X schedule? S Three^ther te 8 F eliabilit y of the 
an observer te-an of slven eachers were visited by 

Planning had preceded ?he vislls^n P housb careful y 
members of the observJV™ if lts to 'insure that all 10 

was not the case. Here tnn ea 2\ would be present, such 
were not the same in ea^h gse? h8 S8V8n observers Present 

copy of the b ?bservationg?heduL t th m W8re SaCh given a 
the categories were ^ 1 , they were to as e and 

satisfaction. This disrn«ln a " d e 5 plalned to their 
test which, in turn wll Ini?™ was v . foll °wed by a field 

discussion of the obtained re!n?tc, by n eon Parison and 
of this procedure results - Upon repetition 

their abilitl go use Ihl felt oon fi d ant in 

the different teachlrshv'lh^^ ale properly. Visits to 
to determine the reldabilitv o? n |h re ? roup were arranged 
the observations wire conduced In ob aerver team. All 

with the teacher's knowledge Ind cols^t? Way nirr0r 

the^SUTEC^team w^confonSt^ith 118 ° f trainln g 

practice. consonant with generally accepted 

&S St atistical Proced urps 

conlillgglnflhltggtllnldgogL^^Tf 1031 

discussed were: soneof Jhf d u P he P r °blen. The areas 



meaning of "crossed" and "nested" factors, the relationship 

Qf AITAT7A T J ~ A- 1 -• -l- - . - _ J. .s . . _ J_ 



ANOVA to reliability estimation, the variables and some 
of the conditions under which different 



aesigns are 

possible, the calculation of expected mean squares, and 
some of the general models under the various conditions 



Difficulties with Tradit ional Reliability Estimates. The 
shortcomings of the product moment coefficient of 
correlation and the percent of agreement between observers 
as measures of reliability of observational data were 
originally pointed out by Medley and Mitzel (1958a, 1963) 
and paraphrased by Brown et al. (1968). Per 6ne thing, 
the sampling distribution of r is dependent on N, the 
number of scores on each item, and it is difficult to 
have large numbers of people view the same classroom 
on two different occasions or to control variations 
between the two visits. Furthermore, the number of 
classrooms visited by two different observers, at two 
different times is likely to be small. In either case 
an N as great as 100 in dealing with observational 
studies is extremely rare. With TJ = 100 the confidence 
interval for the correlation coefficient may be as wide 
as .33 (Medley & Mitzel, 1963) and therefore the 
correlation coefficient is not very precise. At the same 
time most such correlations are usually based on total 
scores which do not take into account variations in 
scoring individual items or categories. 



Percentage of agreement between observers may give very 
little information about the reliability of scores 
obtained. -This is possible if the observed teaching 
practice occurs in each room. For then, the reliability 
of that item as a differentiator of teachers will be 
zero. It is equally posiible that near perfect agreement 
be reached about the number of times that a teacher 
employed a certain category of behavior, and if the teacher 
sharply reversed these behaviors from observation to 
observation the reliability of these categories from 
visit to visit would be zero. 



The shortcomings mentioned above led Medley and Mitzel 
(1983; to develop their single intraclass correlation 
coefficient. The estimate of Rho so obtained was more 
precise than any combination o£* interclass correlations 
because such a combination of correlation coefficients 
was not made up of independent measures. The Medley and 
Mitzel (I9S3) model permitted the calculation of the 
variance attributable to each of the independent factors 
operating during the course of the observations. At the 
sane time, the different reliability coefficients 
appropriate to the uses to which the scores might be put 



-28- 



could all be estimated from the one analysis of* variance# 

Crosse d and Nested Factors . Factors are said to be 
gross eci if each level of each factor appears at least 
once with every level of every factor* Factors are 
said to be nested if each level of each factor appears 

? n 2 * L f vel of the other factors (Hillman & Glass, 
-LyOf /• A factor which is not nested in any other factor 
may therefore be considered crossed (Peng, 1967). 

-Qjf MOVA to Reliability Estimation . The 
/?n^S\ tion or re ^* lat) iiity' given by Medley and Mitzel 
V1903; w as comparable to that discussed b^ Winer (1962) 
more simplified type of design. However, because 
of the comparability of the concepts parts of the 
argument will be reproduced and some of the algebra that 
was deleted will be filled in. The basic definition, that 
tne reliability of k measurements is the ratio of the 
variance of the true scores to the sum of the true score 
variance and the variance due to errors of measurement, 
was the same for both authors. 

Winer ( 1962 ) indicated that the reliability. for the mean 
of k measurements may be estimated by 



r = 

k 



(1/k) (MS . . - MS ) 

between people w. p eople 

} V (1/kJMS 



w. people 



(1/kJ (MS 

between people w. people 

where the variance of the true score, 0?, was estimated 
y the numerator and the sum of the true score and error 
of measurement variances, was estimated by the 

denominator of rj x . Multiplication of both numerator and 
denominator of r k by k yielded, 

MS - MS 

r k 85 between p eople w. 

IMS — 




^us j + ms — 

between people w. people w. people 

- MS - MS 

between people w. people 
MS — — 



between people 



* MS 

between people 

MS — 

between people 



MS 

w. people 

MST “ ' — 

between people 



- 29 - 



* 1 - MS 

v/. people 
MS 

between people 

The estimate of cr § was therefore seen to be 
MS - MS the difference between 

between people w. people, 

the sources of variation while the estimate for was 
the sun of cr| and the error of measurement. 

It was also shown that r^ the reliability estimate of a 
single measurement was given by 

r l *= ( 1 A ) (MS ~ MS ) 

between people w. people 
Tl/K) (MS =H5 ; + ¥S ~ 

between people w. people w. people 

Multiplying both numerator and denominator by k yielded 

r 2 = MS - MS 

between people w. people 

MS + ( k-TjMS 

between people w. people 

For the sake of algebraic brevity the following 
substitutions were made. 

X = (1/k) (MS - MS ) 

between people w. people 

y - ms 

w. people 

Upon substitution into the equation which defined r^ 

*1 = x / (X+Y) 

multiplying, rjf (X+Y) = X, 
dividing by r»i X + Y X / r^ 
transposing, Y = X - X, 

r l 

combining, Y = (X - r^X) / r± 

Substituting back into the original equation which 
defined r^. 



- 30 - 






r k - 



X 



r+'x LIX - r-^j / rjj 



multiplying numerator and denominator by krj, 

r k " kr l x / [ta^X '+ (X - r^xj] 
factoring , 

r k = ' cr l x / [ x (Kr x + 1 - r x )j 
r k = kr l / & + r i (k-l)| 

is t ?} e J . we - L1 kr >own Spearman -Brown prediction 
formula. A somewhat different trsA’ferit^nf ni> _*» 

the sane resuit was given by McNenar (1962). 

caLulitionq 6 ^^ 8 ^^^ 8 ,^ 0 ^ whlch these f0I ™ilas and 
calculations rest are 6 hat MS nay be pooled to 

error d of a est?n«?p t f 0 0f the e ? r ? r of “easurenent, that the 
^ °L°£ estiuate is uncorrelated with the true score, 

and that the sample of n people and k-measuring 

instruments _ are random samples to and from which 

involvpd Z na-p nS h re be Dadej respectively. The nore 
Ca ° es l /h , en these assumptions were not met need 

5 ere Decause essential 0 relationship 
£ een ^ dl ? at;ed and the calculation of aS and c§ 

°f n estimated by following the- "rules or thumb ,,X 
given by Medley and Mitzel (1963). 

The Variables and Designs 

*honf %ho nSiaerl ?§4 specTFic designs, some basic notions 
«!? U L^ e u ?°^ os ^^ ori °f a specific score and its 
relationship to population parameters and "error" will 

of ^, The dis.cussion will be presented in terms 

■im pk 3 ^ 11 ^ 6 ^ 0 ^ 61 and alluded to once again 

in Chapter III of this report. 

©ie factors v;hich may be expected to produce variation 
among observational scores of teacher behaviors are 
Sf^ ces among teachers and differences among visits. 

H?^ COnVeni ? n P e ^ T i w111 be used to represent the 
difference between the mean for teacher i and the mean 
of ali the observations, and V 1 will represent the 
fH f f£ enc ? ^ftween the mean of visit j and the mean of 
yjsit?. In essence then, T± and V% represent 
the deviations associated with teacher i and visit i. 

It is assumed that Tj and V 1 will be the sar.ie for 

every visit and for all teachers on the jth 
visit to each of them, respectively. 



- 31 - 



Human behavior being what it is, it is probable that 

vi“u t than e of Yy 11 £® have differently on the first 
vioit than on the others, while other teachers' 

thot *w« n 2 y cbanse all 3h t;L y fron visit to visit so 
of th» ? k y a great chan G e in the behavior of some 

stati®t?nn? ff, fr ° n the 1 ? 1Ual t0 the "nal visit. In 
static Gical terms, one would say that there is an 

interaction between visits and teachers. If p . , denotes 

the performance of teacher i on visit j and I-^the 

interaction tern, then ^ 

, ^ij “ n T-? + V 4 * I* 4 

visits 11 d ?hf tes the I rand mean of all teachers on all 
Ij JlJ; soo ? e > x ijlo assigned to teacher i by 

the for , vlBlt 3 nay or may not be identical to 

visit tU ?he P » rf ° r ’ U » n ? e fi^ °f that te acher on that 

.. The 1 error is defined as the difference between 
e assigned score and the actual performance. 

Substituting for Pij^^ p ij 

e ** X - (u +T + V + I ) 

ijk ijk i j i •} 

Transposing, X = u + T + V + i + e 

ijk i j ij ijk 

This cinpliiied model follows closelv th at ^^e rented 
Medley and witzei (1958a) and in principle is easily 

modei a ieads to lnolude otller factors. This linear J 



ty 



cr 



•%r 



4 



+ (7 2 + 
V 



4 v 



2 

cr 



^ V' V U V 

% is t! ? e total variance for all observations X. 
Ci IS the variance of T 1 , <j| of the V,, o^ v of the X 
, « ^ i j 



cr 2 of the 



'ijk* 



The variables which were considered essential for 

Mitzer??Q^^ erV ^ i0nS Were those used Medley and 
wibzel { 1903 ) . These were: teacher or classes. 

iter -s, and situations. Under certain 

0 ? e 0r nore of bile variables may be deleted 

th t ^ nal y s:L s» For example, if all the teachers 

or H* n^ d ° n ° n u y ° ne iten by the team of observers, 
or li only one observer did all the observations, or if 

-nd h -itunf\ e Lh Vl ?yh d 0n i- y 0noe, then itens > observers 
cuia situations would have to be dropped from the 

analysis associated with their respective cases. 

e f Ch °£t the above contingencies it is possible 
nave nore than one condition operating concurrently. 



- 32 - 



An example that will be considered in some detail in the 
next chapter, is the case of the partially nested or 
partially hierarchical design. If each teacher were 
visited only once by a team of observers which had the 
same number of people on the team but not the same team, 
the team factor would be nested under the teacher 
t factor. To consider a simple case, suppose that two 
teachers were visited by three observers and rated on 
four items. Each teacher essentially ha 3 a team 
peculiar to himself because the people comprising the 
team for teacher one are not all in the team for teacher 
two, etc. A schematic representation of this situation 
is given In Table 4. 



Table 4 

Schematic Representation of a Three Factor 
Partially Nested Design 



Teacher 


1 


Teacher 2 


Obs. 1 Obs. 


2 Obs . 3 


Obs. 4 Obs. 5 Obs. 6 



Item 1 
Item 2 
Item 3 
Item 4 



The general model for this design h£s the structural form 
(Winer, 1962): 

ABC ** u + A + B +C+AC +BC 

ijk i j(i) k ik j(i)k + error 

where teachers, observers, and items are factors A, B, and 
C, respectively, and the terms on the right side of the 
equation represent" population parameters. In this design 
it is possible that (1) each observation schedule item 
has n subcategories and therefore each cell has n scores 
or that (2) each item has no sub categories and yield only 
one score per cell. In the former case the left hand 
side of the equation and the eri*or term represent the 
mean of the measurements for the n elements under the 
treatment combination Ijk, where i, j, k are the number 



- 33 - 



e£„or torn V the ™ ’ ’ ’ reQ P e °tiyel yj and the 

cells S? the aSe C1 1? V within the respective 

variation the oase /, there bein S no within cell 

ariaLion, the ABCijk ana the error tern refer to the 

ijk rather r thafto e ?r° rS ?i ta H Je(3 under treatment condition 
, ra , f 1 cell values. This model win 

he f^C ffiS5'5f ° f thlC investigation ^r'JTt will 

thi'rtTOe n of n si'tu»tinn ^ alt ?™ ate de sign for analyzing 
"finito P » v j definitions of "random;" 

is random .fetors will be given. A factor 

IS an example of a factor 

» T * ji — 



a 



, . . . , , ^ ui j.tjve 

that 1 i^ U usuflii5 eCtS * • ,,Teachers un example or a fa* 

from a' J f ini te LSn? ®jj;? ereca Random. If a random sample 
.. 11 i^ e population ox levels constitutes the 

. *t he fac ^ 0r iG considered finite. When 
a systematic selection of levels, all levels — o Tn niv 

stud J S the^f actor^ * «° the investigator are included in 
stua^^ the l actor is considered fixed. In each casp 

populat ^on S >n are EeneralTzable to the ‘ ’ 

nrinW 1 -"; 0 ' ir °‘ l whioh the levels were drawn. In terns 
of the variance components , a random factor must be 

rconoonlnt fc in°" Sh r Ut the , a » a1 ^- andisoontained as 
v^ith P ac® "h!h ?° h r ,? Xpeoted nean square term associated 

other “hand I '' ar ' aa ?i° n - »U»d factors on the 

^ er ^. nand are not carried throughout the analysis The 

Lnppfp^ 3 f ° r demining the expression for the 
ofthisconoen^rm ?? are elaborations and applications 
further elaborated in Chapter XII. ^ 

thPTO^n'i™^ 6 approach to the above example in which 
there u ».. on}y one Observation per cell is to treat the 



there v/as 

Tablf^^hnf.^r^ 0 , ^? t0r re P ea ied measure desig 
Table 5 shows repetitions of the items three tin 

02, and O3 and again Oa, Oc, and 



&n # 
.les 



06 



Ol, 



Table 5 

Schematic Representation of Two Factor 
Repeated Measures Design 



Observe r> a 

Teacher 1 



or 

02 

Oo 

C‘4 

Or 



Item 1 



item 2 



item 3 



Item 4 



Teacher 2 



The structural model for this design has the following 
form (Winer, 1962): 

X B u + A + P + B + AB + BP + error 
ilk i k(i) j ij Jk(i) 

where teachers and items are factors A and B, respectively, 

• an ^ ?k( i) an( 3 the error term are the effects of observer 
k who is nested under a "teacher" level and the error 
associated with the observer, respectively. Here too, the 
variables on the right hand side of the equation represent 
parameters . 

Generally, an experiment in which the- same elements are 
exposed to n treatments requires n observations on each 
element. Hence the term repeated measure . The purpose 
of this type of experiment , espe ci al ly in "learning" 
studies is to provide a control on differences between 
subjects. This is accomplished because each subject 
essentially serves as his own control. To the degree that 
specific characteristics of the individual elements 
remain constant under different treatment conditions, 
observations on the same elements tend to be positively 
correlated or dependent. An alternative approach to a 
repeated measures design is to include a nested random 
factor, a dummy variable, in the model to absorb the 
correlation between the experimental errors. This 
approach was followed in Chapter III and was recently 
discussed as a possible way to adapt existing computer 
programs to correlated observations (Clifford, 1968). 

In a two factor design in which there are repeated 
measures on factor B the comparisons between the treatments 
at different levels of factor A involve differences between 
groups as well as differences associated with factor A. 
Under these conditions the main effects of factor A are 
said to be confounded with differences between groups 
whereas the main effects of factor B and the AB interaction 
term are free of such confounding. Tests on factors which 
are not confounded are more sensitive because there are 
fewer uncontrolled sources of variation. 

If one uses the approach that assumes correlated errors, 
the expected mean square of A has this form 

E(MS )=o 2 n. + (b - 1) rj + nbof 
a 

where r is the correlation between pairs of observations 
on the same element and b is the number of levels of factor 
B. The denominator of an F ratio testing the variance of 
factor A is equal to the first part of the right hand 



- 35 - 



side of the H (l IS ^ ) * P_ tests for the B and A3 terns have 
a denominator whose expected value is a 2 ( 1-r). Therefore 
if there is a positive correlation between pairs of 
measurement the factors which are not confounded have 
more sensitive tests because their experimental error 
term is smaller. 



In terns of the alternative approach which postulates a 
nested random factor the E(MS a j for factor A is: 

E(M3 a ) « O’ 2 + bo*j -I- nbo| 

and the denominator of factor A f s F ratio is equal to the 
first two terms on the right side of the equation. ■ The 
denominator for tests on B and AB, the non confounded 
factors, has the form of + o§ p where of D is the subject 
treatment interaction. "The magnitude of is usually 
considerably smaller than that of cf. The above ideas 
are generalizable to more than two factor designs and 
specific attention is given to the alternative, nested 
factor, approach in Chapter III. 



Reference to Table 5 indicates that each item i 3 repeated 
three times for each teacher or that there are three item 
scores per teacher. This is equivalent to the measuring 
instrument being applied to each teacher three times. 
Table 5 to 10 deal with designs in which the item factor 
is repeated. 



Upon using the same notation as Medley and Mitzel (1963) 
where teachers are classes (c), observers are recorders 
(r), and items are items (i) and considering the last 
term as the residual, M the design became as shown in 
Table 6. 



In table 6 the Dc, Dr* and Di, 

depending on whether the c, r, and i factors are fixecT 
or random , respectively. This point will be considered 
again in Chapter III. At the same tine the residual term 
would more precisely have been expressed as Items X 
Recorders within classes with degrees of freedom as given 
and an ^ E( MS J of D r of.. T of# However, just as the 
factorial model pooled the fourth order interaction, 
CXRXIXS, with the residual variance, here too, the 
highest order interaction was pooled with the error term. 
Therefore.., the error term 0 % was replaced throughout by 
which equalled 3r G ri +a e« For the purist, the full 
model can be reclaimed^by substituting D r cr 2 ^ + cr§ for 
throughout the fourth column of 



are equal to zero or one 
and i factors’ 



_ x j-i • — 

Table 6. However, for 



q2 



the repeated measured designs there is no calculation for 



Table 6 

Two Way ANOVA for Repeated Measures 



Source of 
Variation 


Degrees of 
Freedom* 


Obtained 

Mean 

Square 


Expected Mean 
Square 


Between 

Recorders 

1. Class 


rc - 1 
c - 1 


4 


no® + iD r o® + 


2. Recorders 
w. classes 


c(r - 1) 


s r 


2 2 
rt>i*ci + G 

2 2 
iD r c r + cr 


Within 

Recorders 


rc(i - l) 






3* Items 


i - 1 


s 2 

s i 


rcc| + r DqO 2 ^ + o 2 


4. c X I 


(c - l)(i - 


l) s C i 


+ 02 


5. Residual 


c(r - l)(i - 


1) s 2 


0 ^ 



an "error" tern and therefore the I X R w , classes term 
serves as the denoninator for the Within ‘Recorders F ratios. 
Therefore, for all intents and purposes the use of as the 
residual g(MS) is equivalent to the more cumbersome 

^r^ri * 

To complete the analysis, the linear equation in which 
the actual MS ! s serve as estimates for the expected MS ! s 
must be solved. Solution of these equations under the 
assumption of random factors yielded the results indicated 
in Table 7. The expression on the right of Table 7 makes 
the variance components estimates specific to the example 
cited earlier in which c ** 2 (2 teachers), r = 6 
(6 observers), i = 4 (4 items). 

Based on the above and coupled with their (Medley fie P 
Mitzel, 1963 ) "rules of thumb," the definitions of 
and cfj yielded, respectively 

of = (6*4-l) 2 df = 57 6o 2 
-37- 



Table 7 



Estimation of Variance Components for a 
Two Factor Repeated Measures Design 



1 ‘ °i (=)4- ( s c- s r- s ox +s2 ) (=) 1 ( s o“ s r“ a oi +s2 ) 

ri 24 



2. <£ (=) 1 (s 2 - s 2 ) 

X 



(=) 1 (s 2 - s 2 ) 
If r 



2 



3» oi (=)_1 (sf - s^ x ) 



rc 



(=)_JL ( 3 ? - s oi) 
12 1 



^ I ( s ci - s 2 ) 



r 



(=) 1 (sfi - S 2 ) 



5. v (-) 



s 



(**) s 2 



and ajr = (6*4*1) (6*4*1 c£ + 4*1 c% +* 6.1 cn* + o* 2 ) 

X* ^ x 

= 24(24of + 4o^ + 6o 4-a^). 

Therefore, r = cr^ / o| « 24 ct^ / (24o§ *4^ +6a^ ± + cr 2 ). 

In. the previous design there was only one repeated factor. 
This was a two factor design because the teachers we re 
visited only once by a three man team of different people 
for each teacher who rated each item with only one score. 
Under the same conditions with each item receiving n 
scores the design nay be considered a three factor 
experiment with repeated measures within the item factor. 

The measurable variance within each item may change from 
observer to observer and therefore an interaction component 
must be added. A schematic representation of this 
situation is given in Table 8. This may be considered 
a 2 X 4 X 6 factorial design with repeated measures on the 
last factor, n ~ 2. Subscore eight may be represented 
symbolically as P2(4 i)j that is, the second subscore in Gj^. 

The structural model in which n is the number of subs cores 
for this design may be given as (Winer, 1962) 

- 38 - 



Table 8 



Representation of Three Factor Repeated Measures 

Design with n Subscores 



Observers 



Teachers 


Items 


Subscores 


°1 °2 °3 °4 0 5 Og 


Tl 


Item I 


1 


G 11 G 11 G 11 G H Q 11 Oil 






2 


g 12 g 12 g 12 G ig Gig G12 




Item 2 


1 


G 21 g 21 °21 g 21 * - 






2 


@22 ®22 ®22 • • • 




Item 3 


1 


°31 g 31 j-’" 

1 






2 


g 32 ••• / 




Item 4 


1 


• • • • • • 






2 


• • • • • • 
•••••• 


t 2 


Item 1 


1 








2 






Item 2 


1 








2 






Item 3 


1 








2 


* 




Item 4 


1 


, 






2 


g 42 g 4 2 g 42 g 42 g 42 g 42 



- 39 - 



X ijkn = u + Aj_ + Bj + AB i; j + P n (ij) + C k + ACu c 

+ Bc jk + ^ BC ijk + CP k;.i(ij) + error 

where the right member of the equation contains the 
parameters A/, B, and C which are the teachers, items, 
and observers, and n(ij) identifies a subscore within 
group Gj_ j . The AITOVA for this model is given in Table 9 
in terms' of ' classes , items, and recorders. The number of 
subscores, classes, items*, and recorders for the general case 
will be given as n, c, i, and r. For the specific example 
the values 2,2,4, and 6, respectively, will be substituted. 



The residual term for the Within should really have been 
Ogp + <?§. However , as in the case of the two factor 
repeated measures design, a 2 was substituted for the last 
term and this last ern became .-the "residual. u In lines 
5 to 8 of Table 9, the original design can be reclaimed 
by making the substitution ofL + of for a 2 . In lines 1 
to jjr of Table 9, the term that o 2 replaced was really 
^r°rp + cy e' > However, since the estimation of the variance 
components rests on the assumption that r, i, and c are 
random factors, it simplified the design to incorporate 
this fact in the error term throughout Table 9* If r 
were not a random factor there would have been two error 



terms . The 
have been r<J' 
for 
not 



rror* term for the 



the Within 
affect any 



r 



fj2 

u rp 



Between 
0 for a 
+ O' 2 . 



the denominator 
lines would all 



+ Og, since D 
subscores , 

proposed tests of significance because 



sub scores would 
fixed factor, and 
This change would 



for the Between Fratio and 
be reduced by DpcfL. The F .... 
Within would all contain <5vL + 6% since the o§ n 
multiplied by Dp, ^ 



the first three 
ratios for the 
was not 



Solution of the linear equations when the obtained MS * 3 
were used to estimate the expected MS's yielded the results 
given in Table 10. It was assumed that D c , D^, and D r 
were equal to one, or that c, i and r were random factors. 
To complete the reliability calculation, 

of = (6 *4 \L) 2 o§ = 57So^ 

and, of = (6«4»l) (6»4*lcrQ+6>lo§^+6»10p+4'lof+4*lof r 

+cr ir +ff oir +02 ) 

therefore, r = 240^ / ( 24o2+6af i +6o?n4of H-4o§ r +o? r +of ir +o 2 ) 

The designs discussed in Tables 6 to 10 dealt with the 
case of one repeated factor, the item factor. This was 

- 40 - 



PT 



mm 






WWW ¥'*f><4 



Table 9 

Three Factor ANOVA with Repeated Measures on One Factor 



1 Source of 

1 Variation 


Degrees I 

of 

Freedom 


Obtained 

Mean 

Square 


Expected Mean 
Square 


B Between 

B Subscores 


nci-1 






1 1, Class (C) 


c— 1 


a 2 

3 c 


niro^+nrDjaf j+ro? 


1 8. Itens (I) 


i-1 


2 

s i 


+ niDra 2 r+ n DlDr o2 lr+ o- 2 

ncrof+nrD 0 of i +ra 2 


I 3. C X I 


(c-l)(i-l) 


2 

s ci 


+ncDr 0 ir +nD c D r°fir +a2 

nrofi+r^+nDrC^ir+o 2 


■ 4. Subscores 


(F) ci(n-l) 


2 

S P 

2 

s r 


ro?+a 2 


w. groups 
Within • 

Subscores nci(r-l) 

5* Recorders (R) r-1 


r 

nciof +niD c o^ r +n cDia^ 


1 6. C X R 


(c-l)(r-l) 


s^ 

cr 


+nD 0 Diof 1 r+a a 

nl 0 cr +nD i 0 oir +0 ^ 


1 7. I X R 


(i-1) (r-1) 


s ir 


nc<^ r +nD 0 o| lr +o 2 


H 8. C X I X R 


. (c-l)(i*-l)(r- 


9 ~ s cir 


no- 2 ir+o2 


B 9. Residual 


' ci(n-l)(r- 


1) S 2 






Table 10 



1. 

2 . 

3. 

4. 



6 . 



8. 



9. 



Estimation of Variance Components for a Three Factor 
Design with Repeated Measures on One Factor 



CT C ( 



) 4-( s c- s oi- s or+3cir) 



mr 



( ^ 2j^-^ s o -s oi _s or’ +s cir) 



°f _i_( s i “ s oi “ s ir +s oir ) W 



ncr 



°ci -A( s oi- s cir- s P +s2 ) 



nr 



0^ (=) l(Sp-S 2 ) 



nci 

JL 

ni 



^r cr"’ s ir“^ s cir ) 

<&<■ 



) -A( s cr~ s cir ) 



7- <4* (-) ..K s ?r>-4ir) 



ir 

°cir ^ 

2 / 
cr (■ 



nc 



> K s cir- s ") 

n 



) 



/ \ ,2 2 2 2v 

(~) ji.( s ci~ s cir” s p +s ) 



( = ) l(3^ 2 ) 

(=) ^( s r" s cx>” s i r +s ci r ) 
(=) 

.(=) l(s? r -s 2 ip ) 

H |( s cir- s2 ) 



due to the teachers being visited only once by different 
teams. of observers » If the experience of the researcher 
with the items of the schedule has indicated that there 
are no significant sources of variance as a result of 
treating the items as a repeated factor and breaking up 
the Within, then the design may be treated as a factorial 
design with the item factor treated as a regular factor. 

If this is the case, the three factor repeated measures 
design in Tables 9 and 10 can be applied to situations in 
which one of the other factors are repeated. Thus, the 
design may be used when the same teachers are visited more 
than once by different observers or when the same observers 
visit different teachers. In the former situation the 
teacher factor would be treated as the repeated measure, 
while in the latter case the observer factor would be the 
repeated factor. This would merely require replacing the 
Item factor by the teacher or observer factor,, respectively. 

- 42 - 



o 

ERIC 



The next design will consider a three factor design in which 
there are two repeated measures and will deal with the case 
of different teachers being visited by the same observers 
and rated on the same items. In this . situation the 
observer and item factors are the repeated measures across 
teachers. The structural model for this design may be 
Indicated as 



*ijkm * 0 + % + jPn(i) + B j + ABij + + C k 

+ ACjj + + BCjjj + ABCijij 

+ SC^j 1 <n(i ) + error 

Schematically, this situation may be seen in Table 11 where 
A, B, C, are the teachers, items, and observers, and P n /.* \ 
is the sum of the jk observations on subscore n for teaener i.. 

i 

The ANOVA model is given in Table 12 (Winer, 1962) where the 
assumption that c, i, and r are random factors has been 
incorporated into the expected MS f s. This assumption 
permitted the use of the same error or "residual™ term 
throughout. For, ..it will be noticed in Table 13, that if. 

IL * i^and JD^ » 1, i.e., that r and i are random factors, 
then + oj rn appears in each line and is equivalent, to 
the error term of Table 12. 



Table 11 

Schematic Representation of Three Factor Design 
with Two Repeated Measures and n Subscores 





Subscore Item 1 


. . . Item j Total 




obs . 1. . .obs . k 


. . . obs. 1. . .obs . k 




l(i) 

• 


PlCi) 

• 


T 


m(i) 

• 


• 

P m (i) 

• 




nCi) 


• 

P n (D 



- 43 - 



o 

ERIC 



Table 12 



Three Factor ANOVA with Repeated Measures on Two Factors 



Source 


Degrees of 


Obtained 


1 Expected 


of 


Freedom 


Mean 


Mean 


Variation 




Square 


Square 


Between 

Subjects 


nc-1 


• 


* » 


1* Class (C) 


c— 1 


2 

s c 


nirof+irol+nrofj+rofp 
+n i o§ r +i a^p+no^ ir +cr 2 


2* Subscores 


(P) o(n-l) 


2 

S P 


iro^+rofp+ialp+cr 2 


w. groups 






Within 

Subjects 


nc(ir-l) 






3. Items (I) 


i-1 


B? 

1 


n cr of +nra 2 i +r of p +n ccrf y 
+nof lr +o 2 ■ 


4. C X I 


(c-l)(i-l) 


2 

s ci 


nrofi+rofp+nafir+cT 


5. I X P 


c(n-l)(i-l) 


s? n 

ip 


rof-H-a 2 


6. Recorders 


(R) r-1 


2 

s r 


2 2 

n c i o^+n i +i cr r p +n c of r 








2 2 








+na cir +a 


?. C X R 


(c-l)(r-l) 


2 

s cr 


nia cr +io rp +no cir + °^ 


8. R X P 


c(n-l)(r-l) 


s rp 


lofp+o 2 


9. I X R 


(i-l)(r-l) 


s ir 


2 2 2 

ncaJ r +na^ ir +o^ 


10. C X I X R 


(c-l)(i-l)(r« 


- 1 ) s oir 


no fir +cr2 


11. Residual 


c(n-l)(i~l)(r 


-1) s 2 


q2 



- 44 - 



I 



Table 13 

Expected Mean Square of Three Factor Design 
with Repeated Measures on Two Factors 



Source 


E(MS) 


1* Class (c) 


nirof+lrcr^+nrD i a| 1 +rD i ofp+niD r o§p 

+nD 1 0 r .o§ ir +D i D r of r . p +<y| 


2. Subscores (p) 
w. groups 


irap+rDiO^p+iDpO^p+D^DpO^^+o^ 


3. Items (i) 


ncrof+nrD 1 of i 4r<^ p +ncD r (^ r +nD 0 D r of r p 

+D r <T lrp +a e 


4. c X I 


nrof 1 +rofp+nD r ^ lr +n t .crf ir +D r (^ r p+^ 


5. I X P 


rof p +D r a| r p+a| 


6. Recorders (R) 


ncio^4*niE e o^j.+IOj 1 p+ncD^O’-j > j.+nD c Dj > OQ^ I . 

+D i 0 frp +0 § 


7. C X R 


n i ^ r +i^rp +n % cr o i r +D i a irp +0 e 


8. R X P 


1< ^p +D i°frp +a i 


9. X X B 


ncof r +nD c of ir +CT? rp +<^ 


IQ. C X .T X R 


no cir +0 fyp +0 e 


'll* I X R X P 


°irp +a § 


12. Error 


4 



Setting the actual MS’s of Table 12 equal to their 
respective expected MS’s yields 12 linear equations whose 
solutions are given in Table 14. For illustrative purposes 
it was assumed that n = 2, c = 2, i = 4, and r = 6 which 
were the same as the values in Table 10. As a result, 

ff 2 = (6-4) 2 o 2 = 576<? 2 s 
T c c a 





-45- 



Table 14 



3. 

4. 

5. 

6 . 

7. 

8. 

9. 

10 . 

11 . 



Estimation of Variance Components for a Three Factor 
Design with Repeated Measures on Two Factors 



i. <£ (= 



2. OL 



_2 

°i 



(- 

(- 



) -i~( s p" J ' s ci" s ip +s cr“’ s cir +s ) (“) r i.( s p +s ci~ s ip 
mr 4F ^ I ~ 

■ +s cr" s rp- s cir +s ) 
(=) s p“ s ip“ s r p+s ^ ) 



) s p~’ s ip“^rp’^ s ) 



ir 



) — ( s i~ s el" s ir +s oir) (“) -|( s i“ s oi- s ir +s oir ) 



°fi (= 



_2 

°ip 



) (-> ^(4- s ?p-4 r - 2 ) 



( 

(= 



■> 



(=) l(s? -s^) 

' IP ' 



) _i,_( s r“ s cr -s ir +s clr ) (-) l( s r" s or- s ir +s oxr) 
nci lo 



4v (=) _i 

ni 



\ i /_2 2 ,2 . 2 . , , ,, 2 2 2 , 2 . 

nT ( or ‘ " rp-°oir +a } (==) ^ S or- s rp- s cir +s > 



°rp i-( s rp“ s ) 



1 



ir 



(= 



_2 , 
^cifv 



cir 
2 



) _l(Si r -Sc ir ) 
nc 

) K^L^ 2 ) 



n 



/ . .2 2. 

H ^ s rp“ s ) 

(=) j(«L-“olr ) 

(-) i( s L- s2 ) 



(=) 



2 



®x “•• ( 6 • 4 ) ( 6 * 4o^+6 • 4c^+SG®j_+6CT?p+4CT^p+cTg ir ,+(j2 j 

+24 ( 24o®+24c^+6cr| 1 +6c^p+4ef§ 1 ,+4c^ p +cr2 lr+ o' 2 ) 

Therefore, r-*4o^(84<J§+ 24cr^+6^ 1+ 60f p +4of r +402 p+o 2 lr>+a 2) 

°? no ^ 0( 3 that in all the designs discussed up to 
now the situation factor has been equal to one and therefore 
the qjt value which was to be multiplied by the og term to 
yield the numerator of the reliability estimate has 

-46- 



o 






essentially been equal to qj a 2 . The fact that t - 1 was also 
taken into account .where necessary in the denominator as well. 

It was pointed out in the discussion of the three factors with 
one repeated measure that the design was also applicable to 
cases m which the observer or teacher factor were repeated. 

The same reasoning obtains for the design with two repeated 
measures. That is 9 is possible that teachers and items or 
teachers and observers rather than items and observers be 
considered the repeated factors. This is only possible if 
each teacher is observed more than once. If such is the case 

the value of t in the calculation of r would obviouslv not be 
equal to one. 

Inherent in the preceding idea of treating the teacher factor 

a _ r ? pea t ed . measure is the assumption that the visit or 

factor need not be treated as a separate variable 
but may be subsumed under the teacher factor. However, it is 
possible to treat the teacher factor as a non-repeated measure 
by introducing a situati on factor for. the case in which the 
teacher was visited more” than once. In such a situation, 
assuming the same observer and items , one may treat the 
experiment as a four way design with two repeated measures. 

The observer and item factors,, the repeated measures, may be 
considered subsumed under the teacher and situation variables, 
respectively. The linear model may then be given as 

X ijklm = u + A± + Bj + AB ± j + .P m(i j) + C k + AC ik + BCj k 

+ ABCij k + CP km(i j) + D e '+ AD ie + BDj! + ABD-j^ 

+ DF em(ij) + CD kl + A CDik;i + BCD jkl + ABCDij kl 
+ CDP klm(ij) + error 

where A, B, C, D are the teacher, recorder, situation and 
item factors , respectively. 

The expected MS’s for’ this model, assuming n subscores, are 
given in Table 15. Assuming that classes, recorders, situations, 
and items are random factors, D c , D r , D s , and Di all become 
equal to one. Therefore, lines 19 and 20 of Table 15 may be 
pooled to form a new ’’residual” which may serve as the error 
term throughout the design. The results of this change are 

in Table 16 u Table 17 gives the degrees of freedom for 
the design. The solution of the linear equations resulting 
from setting the observed MS's equal to their respective 

expected MS & are. given in Table 18. Based on the estimated 
variance components, 

0*2 = ( g • 4 » 3 ) 2^2 
T c 



- 47 - 



Table 15 

Four Factor Design with Two Repeated Measures 



Source 



E(MS) 



1. Class (c) 



nrsia^+nsiD r aQ r +sio|+nriD s of s +niD r D s o§ rs 
+lD s af p +nrsD i a? 1 +nsD r D i a^ r .+sD i o? p 

+nrD s D i°cs i +nD r D 3 D i CT orsl +D s D i°'sip +0 f 

2. Recorder (R) ncsic^+nsiD o a^ r +sia^+nciD B cr s +niD c D s a| rs 

+iD s°fp +n0sD i 0 ri +nsD c' D i o cri +sD i 0 fp 

+ncD s D 3 _c3^ si +nD c D s D i C7J :i;>si +D s DiCr^ i p+o| 



3,' C X R 



4. 

5. 



JIP 

nsia§ r +sic|+niD s of r>a +iD s o| p +nsD 1 of rl 
+sD, of^+nDpDgO^pg^+DgD^a^^p+cr; 



_ ip 

sio i +iD s a fp +sD i°fp +D s D i a fip +CT ! 



e 

S.W.g. (P) 8l<^ 8 < p +»J>lof P +D s n L 0« lp +of 
Situation (s) ncrio^+nrio^ s +nciD r of s +nii» o D r o- rs +so^ p 

+ncrD i csf i +nrD c D 1 of si +ncD r D 1 of sl +nD c D r D 1 of rsl 

+D i°fip +0 f 

nriof3+niD r of r , s +iaf p +nrD i ag Sl +nD r D i af r3l 

+D i°fip +0 f 

2 2 

^ •? r%*~ J n _Li^i ^ ^ ^ 



6. C X s 



7 . R X S 

8. C X R X S 

9 . S X P 

10. Item (1) 



ncio^ s +niD c o' crf ,+icr^ p +ncD : ]_a^ s ^+nD o D^0^, rs ^ 

+D i°fi P +a i 
2 



cr s i +D s D l ^ i p^e 



f 2 

niC s a crs +1D s cr {3p +nD s D i cr , — 4 -* 

io fp +D i°fip +o f 

ncrsof+nr 3 D o of i +nsD 0 D r c^ riL + S af p +ncrD a oJ i 

+nrDcD s C7j S i+ncD r D s Op Si +nD ( jD r D s of rs j > 

+D,,cf , +cjf 
s sip e 



-48 



J.1, 

12 . 

13. 

14. 

15. 

16 . 

17. 

18. 

19. 

20 . 



CXI 



n^s^i+nsDj.oe +s0 | 2 ^ 

*»A, rsl 0 * orsl ■ 



f sip’*' 0 © 

R X 1 n 0 So ri +nsD c°cri +so fp+ncD a 2 +nD D . 

+D r°ii P +°i ° s orsi 

” * x 1 

so ip+ D s °f i p+ c ^ 

° XSXI ° rSi SiP ' 

n5Xl n<JC ^si +nD o 0 orsi + 0 fip +0 f 
CXRXSXI no crsi +0 iip +0 f 
S x I x p of ip +or| 



Error 



a 2 

e 



/ 



Table 16 * 

Four Random Factors Design with Tuo Repeated Measures 



Source 



E(M3) 



1. C 



2. R 



3. 

4. 



5. 



CXR 

P 



s 



6. cxs 

7. RXS 

8. CXRXS 

9. SXP 

10. I 

11. CXI 

12 . RXI 

13 . CXRXI 

14. IXP 



nrsi< 



5iof+nsi<^ r +si(Tp+nri<^ s +nio^ r3 +iafp+nrsc^ 1 



+nsa cri +so ip +nrcr osi +n0 orsi +a2 
n c s i c^+n s i o^ r +s i o^+n c io 2 s +ni o^ r s +i p+n c s 



rx 



nsic^^sia^H-nla^„„+ia?„+nso^„.,+scq +nc^ r „ i +o^ 



+ns(^ ri +sc| p +ncc^ si -hnc^ rsi +<r 

2 

cr p 'crs ’ J - V 'sp * J cri~ rDW ip' ruw crsi' 

s i cr 2 't-i a 2 +s a^+a 2 
p sp j-P 

ncrio^-i-nrio^g+nclo^g+nio^j^g+ia^p+ncrol.^ 

+nra csl +noo rsi +no orsi +0 ^ 



nri 



°V« +nio ^rs +1 °fp +nr< ^si +na fr S i +o2 



csi ^crsi 



n c i a 2 s +n i cr r s +1 of p+n s i+o^ 

2 2 o 2 

nicr -ficr +n a*- .. +cr 

1 u crs 1 sp crsi 

icr +0 2 
sp 

ncrs of+nrsofi+nsafri+saf +ncro| i +nrof sl 
+nco rsi +no orsi +a ^ 

nrsa^+nsa^.+sc^+nca^.; +nof Tln1 +a^ 



crx xp 



rsi crsx 



ncso^ 1 +nsof r , i +sa|p+nca§ si +nc^ rsi +<r 

ns^i+eofp+nofr.gi+0 2 
so? +ar 



xp 

15. SXX ncraf i +nro^ si +nca2 si +n(^ rsi +a g 



Table 16 (continued) 



Source 




e(ms) 


16 . cxsxi 


nrc csi Jrno orsi +0 ^ 




17. RXSXI ■ 


h0C rsi +nO 0 rsi+o2 


• 


18. CXRXSXI 


no crsi +o2 




19» Residual 


d 2 






-51- 



o 

ERIC 



Table IT 



Degrees of Freedom for a Four Factor Design 
with Two Repeated Measures 



Source 



d.f . 



Actual (MS) 



Between ncr*— 1 



1. c 


c ~ 1 




q 2 

s c 


2. R 


r - 1 




q 2 

s r 


3- CXR 


(o - l)(r - 1) 




cr 


4. P 


cr(n - l) 




2 

S P 


Within 


ncr(si - l) 






5 • S 


s - 1 




2 

s s 


6. CX3 


(c - l)(s - l) 




s 2 

B cs 


7. RXS 


(r - l)(s - l) 




2 

— j— 

rs 


8. CXRXS 


(c - l)(r - l)(s - 


i) 


2 

s 

crs 


9. SXP 


cr(n - l)(s - 1) 




2 

s sp 


10. I 

1 


( i — 1 ) 




2 

s i 


11 . CXI 


(o - l)(i - 1) 




2 

s ci 


12. RXI 


(r - l)(i - 1) 




2 

s ri 


13 . CXRXI 


(e - l)(r - l)(i - 


i) 


s 2 

cri 


14. IXP 


er(n - l)(i - 1) 




s? 

s ip 


15 . SXI 


(s - l)(i - I) 




2 

s si 


16. CXSXI 


(e - l)(s - l)(i - 


i) 


s 2 

S csi 


17. RXSXI 


(r - l)(s - l)(i - 


i) 


s rsi 


18. CXRXSXI 


(o - l)(r - 1 ) { s - l)(i - 1) 


2 

s crsi 


19. Residual 


cr(n - l){s - l)(i - 


■ i) 


s 2 



- 52 - 



«n» 



and 



4 




5l84o^ 












6-4*3(6*4*3 o 2 + 4* 


• 4. 

^ °cr r 


4.3 


°p ■ 


4 6*4 a 2 


+ 4 o 2 + 4 a 2 4 

cs crs 


4 0 ^ 4 
u sp x 


6-3 


4i. 


o cs 
+ 3 °cri 


+ 3 ofp H 


- 6 ofai + 


°crsi + 


CT 2 ) 






72(72 of 


4 12 4- 12 0^ + 24 


°fs 


+ 4 a 2 + 4 o§~ 
crs sp 


+ 18 of t 


+ 3° °crl 


+ 3 °ip 


4* 6 


^si +a crsi + 



Therefore, 
r = 

where D = 



72of / D 

72o^ + 12<4 + 12o§ + 24of s + 4of rs + 4of p 
+ ^®°cl + 3°cri + 3 °fp + ^°fsi + °frai + 



In all of the foregoing models the assumption was made that 
the factors were random variables. The reasons for this 
assumption were that (l) this is the more general case and 
may be applied to a specific instance of a fixed variable 
by dropping out the requisite number of terms through 
the use of the Medley and Mitzel (1963) "ground rules” 
and the fact that D x ~ 0, and (2) that the assumption 
of a fixed variable tends to inflate the reliability 
calculation (Medley & Mitzel, 1963). Part of the data 
will be analyzed in Chapter III under the assumption that 
one of the variables was fixed and the resulting reliability 
estimate was considerably higher than it would have been 
had this assumption not been made. 

Table 18 

Variance Components for a Pour Factor Design 
with Two Repeated Measures 



1. 

2 . 

3- 

4. 

5. 



or^ 

r 

°or 

C 2 

s 



(=) 

(=) 

(=) 

<-> 

(=) 



, / s 2_„2 2 

-=— ' s c s cr °ci 



2 2 2 2 2 \ 



cs 

.2 



crs 
.2 



nr ?\ 2 2 2 _ 

^1 ( s r “ s ri ~ s ra ~ s cr +s crs 

,2 „ 2 _ l „2 _,_„2 „2 , s 2 



cn rsi crsi 
+s cri ) 



cr “p ‘ “ip ' “sp “crs “cri ro crsi 
2 2 



1 ( s Q V -S p+S y^-i-S -3 

i— ( 3 s“ s si _s rs“ s cs +s crs +s oai^ 






S3. 



ncn 



- 53 - 



Table 18 (continued) 



6. 



7. 

8 . 

9. 

10 . 

11 . 

12 . 



^cs 



2 

cr 

rs 



2 

cr 

crs 



a 2 

sp 



*1 



2 

^i 



4x 



(-) 

(-) 

(-) 

(-) 

(-> 

(=) 

(=) 



•i/ 2 2 2 2 v 

— it-( S CS~ S CrS“' S C 5i* i ' S cr'si ) 

nra 



1 (s^ -sf ) 

HcT rs crs ' 

1 (s 2 ~s 2 ~s 2 +s 2 

-[■'crs °sp S crsi +J 

1 (s 2 -s 2 ) 

r s p 

JL ( s f _s f -i +s 2 o1 ) 



d crs ^ sx ci rsi 
1 ( S _ . — . -s~„ 4 +s 



2 ) 

crsi 



2 2 2 
nrs ci cri rsi 

- / 2 2 2 2 
1 (Svi^ -S . -S , +s . ) 
ri cri rsi^crsi' 



ncs 



13 . 


2 

°cri 


(=) 


1 (s 2 1 -ef ~s 2 ,+s 2 ) 

— ' crx ip C rsx / 


14 . 


a 2 

ip 


(=) 


_ z 2 2 v 

1 (s. ~s ) 

F *P 


15 . 


2 

°k 


(=) 


ncr si ~ Scsi ~ S rsi* s crsi) 


16 . 


2 

cr ^ , 
CSX 


(=) 


, , 2 2 v 

— ^ s csi“ s crsi^ 


17 . 


2 

°rsi 


(=) 


^ s r S i" s cr S i) 


CO 

* 


2 

^rsi 


(=) 


1 /„2 0 2 \ 
— A t ’ersi~’ s * 
n 


19 . 


2 

cr 


(-) 


' s 2 



- 54 - 



CHAPTER III 

Analysis of the Results of the Investigation 

The SUTEC Observation data were subjected to' two analyses. 
The first analysis dealt with the reliability of three of 
the iteus which comprised the schedule while the second 
analysis dealt with the reliability estimate for the 
entire scbodule. 

It was the purpose of the present section to develop 
applications of some of the ideas discussed in Chapter II. 
Thus, this chapter addresses itself mainly to the last 
four questions proposed at the outset of the investigation, 
the first three questions having been discussed in Chapter 
II. . 

Reliability of Individual Items 

As wasH "pointed out in the procedure section of the preceding 

chapter, seven observers were present for each reliability 
visit to two teachers. However, only four of the same 
observers were present for both observations. The items 
observed were teacher mobility, involvement of children, 
and irrelevant acts. Because these observations were 
carried out while the schedule was being devised, it w as 
decided that an ANOVA and a reliability coefficient would 
be calculated for each item rather than for the entire 
schedule. 

The analyses were carried out by taking the general 
paradigm and applying it to this specific case. The model, 
in which all variables actually represent deviations from 
their respective means, was 

x ijk = Ci + Oj + Iy -1- ey k 

where C^ represented the deviation associated with teacher 
i, Oj the deviation associated with observer j, lij the 
interaction between teachers and observers, and ejnk the 
"error 11 or residual tern. Upon taking mathematical 
expectations, assuming infinite populations of teachers 
and observers, the result was 

a x - •+ °o + <%o + o 2 

where was the total variance for all the x observations, 
and the terms on the right of this equation were the 
respective population variances for teachers, observers, 
interaction, and residual. Because the ratio, o 

P MScoM error, which tested the hypothesis H]_: Oqq — 9 

Had a value less than one for mobility, involvement and 
irrelevant acts, the interaction and error terms were 
pooled to form a new residual term. The resulting ANOVA 



- 55 - 



Table 19 

Analysis of Variance of the SUTEC Observation Tea:: 1 . Data 



Observed MS 



Source 


df E(MS) Mobility J? Involv, 


JL 


Irr. Acts 


JF 


Class 


1 7c^c +o 2 


4.95 6.05* 38.79 


16, 13**380. 64 


53.42** 


Observation 


6 2 a|+c^ 


2.08 2.55 "6.C6 


2.85 


48.32 


6.78* 


Residual 


r 2 

o a 


.82 .2.41 




7.13 





*P <.05 
**£ < . 01 

is given in Table 20. In both Tables 19 and 20, , <7 q, 

and a 2 are the expected variances of the residual or error 
term, c observers, and teachers or classes, respectively* 

Table 20 

Estimation of Variance Components of the 
SUTEC Observation Team Data 





E(MS) 


Mobility Involvement 


Irrelevant Acts 


of (=) a 


1/7 (b| 


- s 2 ) 13 .59 


5.20 


53.36 


(=) 


1/2 (sf 


- s 2 ) .63 


2.23 


20.77 


o?(-> 


2 

s 


.82 


2.41 


7.13 



a The symbol "(=)" is to read " is estimated by." 
^The s^ terms denote the actual mean squares. 



The overall reliability coefficient was computed by using 
the formula' o 2 

r = 4 / 4 

O p 

where Om and Ov, the true and total variances, respectively, 
wer e defined by Medley and Mitzel (19^3). Here, 

- (qjr) 2 (a 2 ) and cty « (qjr of + qr + cr ) 




whereq is the number of observation records, r the number 
ZA ^V^tiona, gnd j the number of items. Therefore, here, 
yt "" V (Jq — 49 and 

O 2 = (7*1*1) (7*1*1 cr 2 + 7 . 1 . o 2 + (T 2 ) 

C C c 9 

= 49 + 49 cr 2 + 7<J 2 

If the data are analyzed only for the four observers who 
prese «t ' during both reliability visits the observe? 

factor 1,1U 4nnn^ tre ? ted ^ S a o fixed " rather than a "random" 
£§' _ ? d i2 gly, 24 he °S component of of is zero, and 

°&r (^r)(qjr,.*§ * a /* Furthermore, the hypothesis H*. : 

°Q == 0, tested by the ratio F = MS 0 /MS P „* yielded 1 

°£ one or le ? s for air w three items and therefore 
flwnSA AaC S 0r !!? s P ooled wi th the error term. The resulting 

W-^iven I^Tahi^o?*' 'Tot" 06 oora P° nents for these data 
ai?s ' oiVGn iii Taoles 21 and 22, respectively. 

Table 21 

ANOVA for the Four Observers Present 
During Both Observations 



Observed MS 



Source df E (MS) Mobility P Involv. ' P Irr.Aete. F 



Class 1 4 oi+o 2 6.13 6.39* 21.13 S.76* 180.50 10.08* 

2 



Residual 6 <r 

*P~<.Q5 



.96 



3.13 



17.92 



Table 22 

Estimation of Variance Components for the Four 
Observer Present During Both Visits 



E(MS) Mobility Involvement Irrelevant Acts 




(=) 1/4 (s 2 

(=) s 2 • 



3 d ) 1.19 

.96 



4.50 

3.13 



40.65 

17.92 



p 2 

The appropriate values for and Oy were calculated and 
are given in Table 23. These values were used to 
calculate the overall reliability coefficients for the 
entire observation team. These were r (nobility) =• ,72, 
r (involvement ) « * 67 , r (irrelevant acts) = .69 and the 
corresponding coefficients for the four observers present 
during both observations we re r (nobility) = .84, r 
(involvement) = .85, r (irrelevant acts) = .90. Clearly, 
then, one way to increase reliability would be to 
maintain the same observers throughout — a finding in 
complete agreement with common sense and the previously 
cited literature. 



Table 23 

Variances and Correlations for the Entire Observation Team 
and the Four Observers Present During Both Observations 





Mobility 


Involvement 


Irrelevant 

Acts 


Variance 


Team 


4 Observers 


Team 4 


Observers 


Team 4 Observers 


2 

°T 


26.41 


20.67 


254,68 


72.00 


2664.50 650.33 


°x 


36.57 


24.50 


380. 58 


0 

in 

• 

-d - 

00 


3848.37 722. .00 


R 


.72 


.84 


.67 


.85 


.69 .90 



The calculated r’s indicated that 28$, 33$ an d 31$ of 
the variance was attributed to factors other than teachers 
for mobility, involvement, and irrelevant acts, 
respectively. By comparing the observer and residual 
variances it was found that 12.2$ and 15.8, 15.9$ and 
17.1$* and 20.4$ and 10,6$ of the variances were due to 
observers and residual or errors for mobility, 
involvement, and irrelevant acts respectively. 

The finding that th-e variances due to different teachers 
ranged from .67 to .72 indicated that the observation 
schedule differentiated bet ween teachers on the variables 
investigated. Furthermore, in two of the three cases 
less than 16$ of the variance was due to observers while 
in the third case approximately 20$ of the variance was 
due to observers. This latter finding indicated a need 
for intensifying the training procedures of the observers 



- 58 - 



on this fact or * or the sharpening of the definition of 
this variable, or both. 

As was indicated in Chapter IX, fixed factors tend to 
inflate the reliability estimate and the average increase 
in r for the four observer case over the seven observer 
situation was .17. Besides the obvious rationalization, 
this was due to the fact that the denominator of r 
decreased more quickly as a result of the removal of the 
fixed” factor. However, why irrelevant acts, the most 
subjective category yielded the highest r for the four 
observer situation still remains to be investigated. 

Reliability of the Entire Schedule 
Since each teacher was observed by an observer team 
peculiar to himself, the model was considered a partially 
hierachical design. That is, each observer team had the 
same number of observers but not necessarily the same 
observers and therefore the observer team factor was 
nested under the teacher factor. If teachers were factor 
A, observers factor B, and items factor C, B would be 
nested under A. Assuming that there were n scores on each 
item for each teacher per observer the sources of 
variation, degrees of freedom, and expected mean squares 
were as given in Table 24 (Winer, 1962) where p, q, and 
r were the numbers of teachers, observers, and items 
respectively. 

The Dp, Dq, and Dr* terms are equal to 1-p/P, 1-q/Q, 1-r/R, 
respectively, where the p and P, q and Q, and r and R are 
the sample and population parameters of teachers, observers, 
and items, respectively. Each of these D*s is either 0 or 
1 depending on whether the corresponding factor is fixed 
or random. 

As was pointed out by Medley and Mitzel (1963), the 
assignment of a variable as fixed tends to reduce the 
error of measurement and hence Inflate the reliability. 
Therefore, the assumption that a variable is fixed should 
be based on sound reasons, a rule of thumb for selecting 
which factors are fixed and which are random is to decide 
whether other elements comprising the factor might have 
been used, and if so, then the factor is random (Medley 
& Mitzel, 1963). For example, if no observers other than 
the ones actually employed .could have been used 
satisfactorally, then the observer factor would be fixed. 
Since there are always other teachers and observers 
available, theoretically anyway, these factors are usually 
considered random factors. -These Ideas are consonant 
with the definitions given in Chapter II of this 
investigation. 



- 59 - 



Table 24 

Sources of Variation , Degrees of Freedom, and Expected 
Mean Squares for an ANOVA Design with Factor B 

Nested Under Factor A 





Source of Variation df 


e(ms ) 


A 


P-1 


nqrcr|+nrD q c|+nqD r ,df 0 






+nD q D r cr bc +aa 


B W . A 


p(q-i) 


nr °b +nD r °b o +cr e 


C 


r-1 


npqof+nqD p o-g 0 +nD q cr bc +ag 


AC 


(p-l)(r~l) 


n<3ff ac +nD q°bo +cr l 


(B W.A)XC 


p(q-l)(r-l) 


9 2 

nof *kt 
be e 


Within 


pqr(n-l) 


J2. 

°e 



More precisely, as p, q, and r, the number of the samole 
elements, approach the values of P, Q, and R the number 
of elements in the population, the ratios p/P, q/Q, and 
r/R approach a value of one and therefore D n , D n , and Dy> 
approach zero. If zeros are substituted for thd D’s 
the number of factors contained in the expected mean 
squares shrink and thus the reliability is increased 
because the denominator of the fraction which defines 
the reliability coefficient is decreased. 

The model is also applicable even when there is only one 
score per item per observer for each teacher. In this 
case the model is the same as in Table 24 with n=l and 
the within source of variation removed. If all factors 
are random and ones are substituted for the D’s the 
model now yields an error term of a? fa| (Uiner, 1962). 
The ^ remaining expected mean square values follow in a 
similar fashion. To simplify the model still further the 
Medley and Mitzel (1963) procedure nay be utilized. 
According to this procedure, the last term in the source 
of variation column, the residual, _is considered to be 
the error term and is denoted by c~ rather than a£ n +a 2 . 
The simplification of the error term and the substitution 
of ones for the n and the D’s result in the expected mean 
squares shown in Table 25. 



o 



-60- 



.he. only na ^? r difference between the V/iner (1962) and 
a n d Mit ? el (i 963 ) approach occurs in the F ratio 
F rat?n , Hif ln effects of Factor A. This particular 
Ind 1™ ^^lizes the nested factor B as its denominator, 

sirr,i??iL la Se ? ex P ected tlean square tern in the 
siniplifi e d version than is called for by Winer ClQ 62 ^ 

The difference between the models is due to the'ol tern 

the S hTOothe^® 2 | a i! S o tl > lat < -u 2l 3 nifi °ant F ratio testing" 
certain! ll ? the simplified version would 

fho S significant according to V/iner (1962). Since 

t£e other two F ratios testing the hypotheses a?- L 0 and 
?ls^- 0 ? se th ® fesldual expected mean square al 

?iofof at0rS ' ? 0th the Medle y and Mitzel (1963) and Winer 
££f?) approaches yield the same F values in these two 

L-clbGS • *“ 



Tslbls 25 

ANOVA Design with Factor B Nested Under Factor a* 
All Factors Rand on 5 and n = 1 



Source of Variation 


df 


E(MS) 


A 


p-x 


qra a- rro b(a) + ^ 0 +or2 


B W.A 


p(q-i) 


2 p 

%Ca) +<?2 o 


C 


r-l 


O y 

pqag+qcr^ +cr 
c ac 


AC 


(p-l)(r-l) 


qo^ +cr 2 






ac 


Residual 


p(q-l)(r-l) 


o2 



?v%f r€ L!^? aC mu al i? two hono 3eneity assumptions implied by 
f - r\ vi( a first is that the source of variation due 

represents the pooled variation of. observers 
“ e f chers * The second results from the fact that the 
residual term is actually the B(a)XC interaction term and 

?h? r h SentS C ^f P°°ling of different sources of variations. 
The homogeneity assumption here is equivalent to the 

10 \ th &t the correlation betv/een items is constant 
within each of the teachers. ouncoauu 



Three teachers were observed once through a one way plass 
by three different observer teams. Each observer tea 5 
C °2 t f’i nec3 seven members, but some of the observers were 
not the same throughout all the observations and therefore 








the teams were considered different. 

In line with the earlier discussion of random and fixed 
variables * the teacher and observer factoids were considered 
random factors , but because the observers were instructed 
to disregard all behavior other than those on the 
observation schedule the items were fixed. Accordingly, 
the term in the first and third lines of Table 25 were 
dropped from the expected mean squares for teachers and 
items, respectively. The actual and expected mean squares 
for this specific situation in which p=3, q-7* amd r=7 are 
given in Table 26, 



Table 2 6 

Analysis of Variance of an Observation Schedule 
Containing Seven Items and Using 
Three Observer Teams and Three Teachers 



Source of Variation 


df 


E(MS) 


Observed (MS) 


A (Teachers ) 


2 


4 9°l + °b ( a ) +<?2 


2 

s a 


= 42.05 


B(a) (Observer 


18 


7°b(a)- 1CT2 


s 2 

b(a 


.= 6.66 


within Teachers 




C (items) 


6 


21 cr+cr 2 
c 


«! 


=340.20 


AC 


12 


_ 2 2 
7a +a 
ac 


s ac 


= 56.42 


Residual 


108 


a 2 


s 2 


= 2.90 



The general set of linear equations which must be solved 
to find the estimated variance /components is constructed 
by setting the estimated mean square terms equal to their 
corresponding observed mean squares. The resulting linear 
equations are then solved simultaneously. Table 27 gives 
the particular set of linear , equations for the specific 
case listed in Table 2 6 and the resulting estimated values 
of the variances for each factor. 

2 2 2 

The three hypotheses cr « 0, cr = 0, a = 0 were all 
rejected because their a respective F ratios, 

F a - MS a / MS b( a) = 6.31, 

F c = MSq / =117*36, 



p ac ~ '^ s ac / ^residual ^9.47 > 

were all significant at the ,01 level. The appropriate 
df’s are given in Table 26. The rejection of these three 
hypotheses indicated that the scale does differentiate 
between the teachers and the items, and that there was a 
significant interaction between these two non nested factors. 

Table 27 

Estimation of Variance Components for an Observation 
Schedule Containing Seven Items and Using Three 
Observer Teams and Three Teachers 



2 

°a 


(=) 


2 2 x 

1 ( S - S. , v ) as 

2jgr a b(a) / 


.72 


a b(a) 


(=) 


./ 2 2. 

l(s , . - s ) = 


* 54 


2 

a 

c 


(=) 


l(s'~ *- s 2 ) *» 

2T c 


16. 06 


2 

or 

ac 


(=) 


p 2 

l(s - s ) - 

7 ac 


7.65 


2 

cr 


(=) 


_2 

s — 


2.90 



The overall reliability coefficient (Redley & Mitzel, 1963) 
is equal to 

2,2 

^XX ~ °rp / °X 

Here, = (qr) 2 of = (7-7) 2 o- 2 = 49 2 (.7222) » 1734.00, 

and o| = qr (qrof + rof( a ) + qof 0 + <j 2 ) 

- (7*7) 07-7) (.7222) + 7(.537 6) + 7(7.6460) 

+ 2. 8983 J = 4682.99. 

Therefore, R xx = 1734.00/4682.99 = .37 

The .37 reliability coefficient indicated that 37$ of the 
variance was attributable to the teacher factor and 63$ 
of the variance was due to the items, interactions, and 
residual factors. An examination of the ratio of the 
variances due to teachers and observers, the factor nested 
under teachers, indicated that 21.2$ and 15.8$ of the 

-63- 



component of the total variance due to teachers .was due 



to teachers and observers, respectively, A similar 
calculation for the other factors comprising the remaining 
63^ of the total variance yielded values of 38. lo.l^ 
and 6.9?o for the items, interaction, and error or 
residual terms, respectively. 



The proposed model did permit the partitioning of the 
variance associated with an observational schedule into 
its component parts and the calculation of an overall 
reliability coefficient. In the particular case to which 
the model was applied 75fe of the variance was due to 
teachers and items, each of these two factors contributing 
equally to the total variance. Only 15 . 8 fo of the total 
variance was due to observers; the factor nested under 
teachers. These facts permit one to conclude that the 
variance due to different observers being used was 
considerably smaller than that due to the different 
teachers as they we re observed on the various types of 
behavior represented by the items of the observational 
schedule. 



That the items accounted for the single largest source of 
variance was probably due to the very different . elements 
of behavior being observed. For example, mater’ials present 
required very little judgment on the part of the ooserver, 
while involvement of children required a great deal of 
judgment. 



As a result- of this reliability study some ^ confidence can 
be placed in the observation schedule’s ability, as used 
by this observation team, to differentiate between teachers. 
A well trained team might therefore be used to observe 
teachers who were trained at various institutes or under 
different conditions at the same institution for the 
purposes of comparison. The data could then be analyzed 
and if differences existed, the superiority of one method 
of teacher training over another could be inferred. 



'The data could also have been analyzed using a repeated 
measures design as was pointed out earlier. This analysis 
yieldeu exactly the same information, resulting from the 
nested design used here. Verification is left as an 
exercise for the interested reader. 



- 64 - 



CHAPTER IV 

Conclusions and Re comend at ions 



Conclusions 

!FETs investigation sought to develop and apply analysis of 
variance techniques to the estimation of the reliability 
of observation schedules* 

The investigation placed special emphasis on the different 
possible designs and the various administrative situations 
under which they might be applied. The application of 

the general model to a specific instance was then carried 
out. 

The study was conducted with 10 recorders who observed 
five teachers through a one-way mirror and rated them 
on an observational schedule. This procedure was 
followed for each of three of the item categories 
comprising the schedule and for the entire schedule. In 
the first instance , two teachers were observed by teams 
of seven recorders. In the second situation, three 
teachers were observed by teams of seven recorders. 

The materials used In this investigation was the SUTEC 
Observational Schedule which contained seven items. The 
observations for the estimation of reliability were all 
carried out at SUTEC. 

Analysis of the data revealed that the overall reliability 
coefficient was .37 and that . 72 , . 67 , and .69 were the 
reliability coefficients for the mobility, involvement, 
and irrelevant acts items, respectively. When the 
observer factor was treated as a fixed factor the Item 
reliabilities became .84, . 85 , and .90, respectively. 
Seventy-five per cent of the variance was accounted for 
by teachers and items for the overall reliability 
calculation, while approximately 70 $ of the variance was 
attributed to the teacher factor for the individual item 
reliabilities. 

At this juncture, it must be pointed out that the 
application of the ANOVA technique to the SUTEC data does 
not even exhaust the few designs described earlier. 

Rather, this application was meant as an illustrative 
example of the wide range of possibilities which more 
accurate reliability calculations of observational data 
make possible. 

Once reliable observations are possible, these types of 
data which may have been considered rather subjective will 
no longer be avoided by research workers. The 



- 65 - 



respectability of observational data which may result has 
ramifications for a number of areas such as teacBer 
supervision and training. However, any situation in which 
observations may be used is actually an area in which 
the reliability of the data may be calculated as .indicated 
earlier in this paper. It may therefore be possible to 
utilize observational data in such disparate fields as ^ 
educational psychology, industrial psychology, and social 
psychology . 

The obvious areas of educational psychology such as teacher 
supervision and training have been stressed . throughout 

paper. Other aspects of school situations, particularly 
observations of children’s behaviors during the teaching- 
learning situation as well as classroom and playground 
social interactions may now be studied. 

Areas of industrial psychology, such as the behavior of 
workers under different conditions, market research, and 
behavior during labor disputes and labor negotiations may 
come under more rigorous study through the . use of reliable 
observations. Some social psychologists might find 
observations to be a fruitful way of studying such diverse 
phenomena as mob reactions , school disorders , and the 
behavior of jux^ies. 

Clearly then, the work presented in this . paper has 
ramifications for many fields . The application was spec.if ic 
to a pedagogical situation because the problem first came 
to the attention of the investigator in an educational 
context in which the data generated was related to an aspect 
of teacher training . 

The major conclusions, presented within the limited scope 
of this investigation, were that different variance 
components models could be applied in different situations 
to estimate the reliability of either the entire observation 
schedule or parts of it, and that the items comprising the 
SUTEC schedule did differentiate fairly well between 
teachers , 

R ecpTnmendations 

The results of the present research prompted the following 
recommendations : 



1. Construction of Computer programs to analyze observational 
data gathered undei' 1 the various models described herein. 

2. Extension of the models to situations in which an . unequal 
number of observers, items, or situations wex’e used without 



- 66 - 



requiring that some of the data be randomly discarded. This 
might be possible through an unweighted-means or a least- 
squares solutions analysis. 

3. Investigation of the paradigms presented as a possible 
means of determining the homogeneity of the items comprising 
a schedule or proposed schedule. 

4. Field testing of the different models simultaneously to 
permit comparison of the results. If the differences in the 
estimated reliabilities are slight, the simplest 
administrative procedures could then be adopted as the 
standard. 



References 



Bloom, R. , 8 Wilensky, J. Four observation categories for 
rating teacher behavior. Journal of Educational 
Research , 1967, 60, 464-465. 

Bobbitt, R. A., Gordon, B. N. , 8 Jensen, G. D. Development 
and application of an observational method: Continuing 

reliability testing. Journal of Psychology , 1966, 63, 
83-88. 

Brown, B. B., Mendenhall, W. , 8 Beaver, R. The reliability 
of observations of teacher’s classroom behavior. 

Journal of Experimental Education , 1968, 36, 1-10. 

Chapline, E. School University Teacher Education Center 

Technical Progress Report. September, 1968, Cooperative 
Research Project no. 5-0945, Contract no. OEC 1-6-050 
945-1673, United States Office of Education. 

Clifford, T. Simplified computational programming for 

analysis of variance designs with correlated observations. 
Psychological Bulletin , 1968, 69, 439-440. 

Courson, C. C. The use of inference as a research tool. 
Educational 8 Psychological Measurement , 1965, 25, 
1029-1038. 

Denny, D. A. Identification of teacher-classroom variables 
facilitating pupil creative growth. American Educational 
Research Journal , 1968, 5, 365-383. 

Flanders, N. A, Interaction analysis in the classroom: a 

manual for observers. Ann Arbor, Mich.: University of 

Michigan, 1960. 

Furst, N. , 8 Amidon, E. J. Teacher pupil interaction 

patterns in the elementary school. In E. J. Amidon 8 
J. B. Hough (Eds.), Interaction analysis: theory , 

research, and application . Reading, Mass.: Addison- 

Wesley, 1967,. 

Katz, L. G., Peters, D. L., 8 Stein, N. S. Observing 
behavior in kindergarten and preschool classes. 

Childhood Education , 1968, 44, 400-405. 

Maas, J. B. Patterned scaled expectation interview: 

reliability studies of a new technique. Journal of 
Applied Psychology , 1965, 49, 431-433. 

Psychological statistics . New York: Wiley, 



McNemar , Q . 
1962. 



- 68 - 



I 



T 



~ Z 



1 



Medley, D. M. The use of orthogonal contrasts in the 
interpretation of records of verbal behaviors of 
classroom teachers. Research Memorandum RM-67-25 . 
Princeton, N. J.: Educational Testing Service, 1967. 

Medley, D. M. , S Mitzel, H. E. Application of analysis of 
variance to the estimation of the reliability of 
observations of teachers’ classroom behavior. J ournal 
of Experimental Education , 1958, 27, 23-35. (aT 

Medley, D. M., £ Mitzel, H. E. A technique for measuring 
classroom behavior. Journal of Educational Psychology , 
1958, 49, 86-92. (b) 

Medley, D. M. , S Mitzel, H. E. Measuring classroom behavior 
by systematic observation. In N. L. Gage (Ed.), 

Handbook of research on teaching. Chicago: Rand 

McNally, 1963, pp. 247-328. 

Millman, J., £ Glass, G. V. Rules of thumb for writing the 
Anova table. Journal of Educational Measurement , 1967, 
4, 41-51. 

Ojemann, R. H. , £ Snider, J. The effect of teaching . program 
in behavioral science on changes in causal behavior 
scores. Journal of Educational Research , 1964, 57, 
255-260. 

Peng, K. C. The design and analysis of scientific 

experiments . Reading, Mass.: Addison-Wesley , 1967. 



Rusch, R. R., Denny, D. A., 8. Ives, S. The development of 
a test of creativity in the dramatic arts: a pilot 

study. Journal of Educational Research, 1964, 57, 
250-254." 

Scott, W„ A. Reliability of content analysis: the case of 

nominal scale coding. Public Opinion Quarterly , 1955, 

19, 321-325. 

Seibel , D. W. Predicting the classroom behavior of teachers. 
Journal of Experimental Education , 1967, 36, 26-32. 

Soar, R. S. An integrative approach to classroom learning . 
Philadelphia: Temple University, 1966. 

Winer, B. J. Statistical principles in experimental design . 
New York: McGraw-Hill , 1962. 

Zunich, Y. Child behaviors and parental attitudes. Journal 
of F ology . 1966, 62, 41-46. 



- 69 - 




APPENDIX 

School University Teacher Education Center 

P.S. 76 and Queens College Education Department 

36-36 Tenth Street 
Long Island City, N. Y. 11106 



School 

Grade 

Observer 



Teacher 

Date 

Time 



Developed for use in the SUTEC project, November, 1967 by 
Elaine Chapline , Ph.D. and Theodore Abramson, M.S. 



Attachment 1 



SUTEC Observation 
Number of children in class 



ROOM 8IQ5TCH 



Front 



i 



Rear 



Positions of (v/ ) windows , (d) door(s). (td) 
teacher s desk, (CD) Children's desk, in groups, (SI) 



special interest areas. 



Teacher mobility is indicated by marking teacher position 
the roon sketch during the second five r.iinute 



activity. Use an ordered pair to work 



of each 
; ach position. 



i.e. 1,1-first activity, position one 

l,2=first activ ,cy, position two, etc. 
2,1-second activity, position one 
2,2=second activity, position two, etc. 
3*1 third activity, position one, etc. 




arntmmtmm 



Attachment 2 



Involvenent of Children 



Activity 






Scale 


1. 




- 


1. Uninvolved 


1 

2. 


2 


3 


2, Moderately 
involved 


i 

3. 


2 


3 


3. Highly 
Involved 


.T 

*t m +p m 


“2 


3 




4. 








I 


2 






5. 








I “■ ~ 


15 


3 





6 . 

T 



2 



3 



Attachment 3 





Materials 


Present In Use 


1 . 


Chalkboard 


1 . 


2 . 


Bulletin board (s) 


2 . 


3 . 


Maps, charts, or pictures 


3. 


4 . 


Visual Aids (films, etc.) 


4 . 


5 . 


Audio Aids (records, etc.) 


5 . 


6 . 


Text 


6 . 


7 . 


Library materials, magazines 


7 . 


8 . 


Arts and crafts 


8 . 


9 . 


Play materials (dolls, blocks 
etc. ) 


, 9. 


10 . 


Science equipment (fish tank, 
etc. ) 


10. 


11 . 


Commercial supplemental 
materials (games, rex. sheets 
workbooks , programmed 
materials, etc.) 


11 . 

$ 


12 . 


Teacher made supplemental 
materials 


12 . 



- 73 - 







Attachment 4 

This observation of behavior should be used when the 
teacher is directing an activity for either the total 
class or a subgroup. Keep these tallies for 5 minutes, 
i.e. the thir d 5 minutes of an activity* Note if the 
time sample Is* other than 5 minutes. 



Categories 


! 

Tallies ; 


Acl 


• 

;ivities ; 


1 


2 


3 


4 , 


5 


• 6 ; 


I 




» 








1 


II 












- - - I 


III 












1 

$ 

t 

} 



Definitions of Categories : 



I: A child talks or moves relevant to the activity 

without the teacher’s direction or permission, 

II: Teacher calls on child as a result of "hand raising” 

by child. 

Ill: Teacher calls on child without prior “hand raising” 

by child. 



Attachment 5 






Child No, of Irrelevant Acts Sex Totals 



3 

4 

5 

6 

7 

8 

9 

10 

11 

12 



9:45 AM 
1:45 m 

Find the child nearest to you and observe him for 2 
minutes. Record each irrelevant act with a tally. 
Find the third child from the one just observed and 
record his Irrelevant actions for a two minute 
period. Continue, until six children have been 
observed, for a total of 12 minutes. 

10:15 AM 
2:15 PM 

Continue from last child until six more children 
have been observed. Children may be repeated If it 
is their turn on the second go-around. 



- 75 - 



*• ** * 



S v 



